Skip to content

[OCPCLOUD-1107] External cloud-provider support via FeatureGate in post-install#2386

Merged
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
Danil-Grigorev:external-provider-support
Jun 24, 2021
Merged

[OCPCLOUD-1107] External cloud-provider support via FeatureGate in post-install#2386
openshift-merge-robot merged 2 commits intoopenshift:masterfrom
Danil-Grigorev:external-provider-support

Conversation

@Danil-Grigorev
Copy link
Copy Markdown

@Danil-Grigorev Danil-Grigorev commented Feb 2, 2021

- What I did

Implementing external cloud provider selection forced by feature gate. Described in the proposal: https://github.com/openshift/enhancements/pull/463/files#diff-3e0e2c48e70215076dfe36c13768a823ab7080d929d80292f37db2ef5a2121e8R201

This PR is main blocker on other work with integrating CCM in Openshift: https://issues.redhat.com/browse/OCPCLOUD-1107

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 2, 2021
@ashcrow ashcrow requested review from yuqi-zhang and removed request for ashcrow February 2, 2021 21:26
@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from 20321ce to ac4d0dd Compare February 2, 2021 21:28
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 2, 2021
@kikisdeliveryservice kikisdeliveryservice changed the title External cloud-provider support via FeatureGate in post-install [WIP] External cloud-provider support via FeatureGate in post-install Feb 2, 2021
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 2, 2021
@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from ac4d0dd to f05c4d0 Compare February 2, 2021 21:41
Copy link
Copy Markdown
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel like the team/arch also needs to review the enhancement

/hold

Comment thread pkg/controller/kubelet-config/kubelet_config_controller.go Outdated
@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2021
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

Feels like this should leverage: #2352 via: https://github.com/openshift/api/blob/master/config/v1/types_feature.go#L121

@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from f05c4d0 to 62fc109 Compare February 3, 2021 14:47
Comment thread pkg/controller/template/render.go Outdated
Comment on lines 340 to 342
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not need to be in the switch so that we can enable the external provider platform by platform?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, yes. In TechPreviewNoUpgrade it is on feature gate to set invariant configuration when found. I don't think anyone will risk setting such gate on production cluster for platform we don't have configs for. This just means less changes in development cycle, if we don't make it a switch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought we would want this to only affect things on platforms we have allowed it to, hence being provider specific, let's discuss on the call next week

@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from e4be2f8 to 655727b Compare February 4, 2021 20:55
Copy link
Copy Markdown
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Danil-Grigorev I've gone through the enhancements, but am having problems connecting the dots here.

Could you add some detail/specific description as to what this PR is trying to accomplish, whether this PR is like step 1 of X etc... if this is day2 post-install is this user initiated? etc? Will there be any followup prs or are there prereq prs for this? etc

Thanks! :)

@Danil-Grigorev
Copy link
Copy Markdown
Author

@kikisdeliveryservice Sure. The prerequisite PR will be an API change for FeatureGate in openshift/api#738 The name of the featureGate, it's location and related changes in functionality may be corrected based on new data from kubernetes/enhancements#2443. What it will result into - we will have a way to deploy a techPreview clusters with out-of-tree providers code, and attempt migration from in-tree in post install time, while identifying issues (CI use-case). It is user initiated.

Ideal scenario - follow up PRs will only remove selections from platform switch case, say: AWS is now by default external in 1.24. Realistically, we will still have to force some kubelet flags even after the upstream feature gate is implemented, as the goal is to identify issues in migration, which may break some storage, networking and cloud-credentials functionality in the cluster if done blindly.

Actual change, which we will probably need - implement our selection (or more realistically wait for some helper upstream implementation) in MCO install phase, where the bootstrap methods are called. This way a fresh cluster will get out-of-tree providers by default with related DisableCloudProviders feature gate . This is out-of-scope for this PR.

@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from b042fdd to d36cbbb Compare February 11, 2021 13:27
@Danil-Grigorev Danil-Grigorev changed the title [WIP] External cloud-provider support via FeatureGate in post-install External cloud-provider support via FeatureGate in post-install Feb 11, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 11, 2021
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

Thanks for the summary @Danil-Grigorev !

@Danil-Grigorev
Copy link
Copy Markdown
Author

/retest

@Danil-Grigorev Danil-Grigorev force-pushed the external-provider-support branch from 8fd0155 to f96cd2f Compare June 17, 2021 15:04
- Add external-cloud-provider feature gate into render
- Add exclusion list to featureGate selection: a list of openshift specific
featureGates, which will not be passed to default kubelet config
@Danil-Grigorev
Copy link
Copy Markdown
Author

@JoelSpeed @kikisdeliveryservice All comments addressed. I checked, permission change happens on revendor, and it seems expected. Please review.

@JoelSpeed
Copy link
Copy Markdown
Contributor

checked, permission change happens on revendor, and it seems expected. Please review.

Perhaps since this seems unrelated to the contents of this PR, should we just drop these two files from the commit? Would make the history cleaner and then the vendor discrepancy can be sorted at a later point?

Copy link
Copy Markdown
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks @Danil-Grigorev

@elmiko
Copy link
Copy Markdown
Contributor

elmiko commented Jun 21, 2021

just wanted to note that i have been test driving this patch for the last few days and it seems to be working for me. i have not given a thorough testing of the negative case (eg when the gate is not applied), but i have done a basic smoke test on that.

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/retest

@elmiko
Copy link
Copy Markdown
Contributor

elmiko commented Jun 22, 2021

@Danil-Grigorev , i talked with @kikisdeliveryservice in slack and it sounds like we could help here by adding a must-gather for a cluster that has had the feature gate enabled as a day 2 operation. so, basically:

  1. create normal cluster with in-tree providers
  2. enable feature gate and switch cluster to out-of-tree
  3. once cluster is healthy, collect must-gather

i think we are already capturing this behavior in the openshift-e2e-aws-ccm step. but since we need this PR to make it work, none of those jobs have been passing. @Danil-Grigorev does that make sense?

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

just to follow-on the above, would this ever be disabled day2 @Danil-Grigorev ?

@Danil-Grigorev
Copy link
Copy Markdown
Author

@kikisdeliveryservice No, applying the feature gate would move cluster to TechPreview state, where we officially do not support rollback. Yet the code does not prevent it from happening and works both ways.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 22, 2021

@Danil-Grigorev: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws 39e07de link /test okd-e2e-aws
ci/prow/e2e-aws-disruptive 39e07de link /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@elmiko
Copy link
Copy Markdown
Contributor

elmiko commented Jun 23, 2021

quick update here, we have shared a must-gather with @kikisdeliveryservice

Copy link
Copy Markdown
Contributor

@kikisdeliveryservice kikisdeliveryservice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed must gather and verified:

we started off with --cloud-provider=aws
and
successfully switched to --cloud-provider=external

Based on my understanding from: https://github.com/openshift/enhancements/pull/463/files#diff-3e0e2c48e70215076dfe36c13768a823ab7080d929d80292f37db2ef5a2121e8R270 also saw: ExternalCloudProvider: true alongside --cloud-provider=aws which helps force upgrade from intree to out of tree.

Since this has also been extensively tested by that team, let's merge this to unblock work and move forward.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Jun 23, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Danil-Grigorev, JoelSpeed, kikisdeliveryservice, rphillips

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 23, 2021
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/hold cancel

@openshift-ci openshift-ci Bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 23, 2021
@sinnykumari
Copy link
Copy Markdown
Contributor

/test e2e-agnostic-upgrade

@openshift-merge-robot openshift-merge-robot merged commit 7de6ba3 into openshift:master Jun 24, 2021
@Danil-Grigorev Danil-Grigorev deleted the external-provider-support branch June 24, 2021 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants