Skip to content

Bug 1947684: delay kubelet config readiness until after pools and controller config are ready#2517

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
harche:bump_maxtries
Apr 13, 2021
Merged

Bug 1947684: delay kubelet config readiness until after pools and controller config are ready#2517
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
harche:bump_maxtries

Conversation

@harche
Copy link
Copy Markdown
Contributor

@harche harche commented Apr 8, 2021

Signed-off-by: Harshal Patil harpatil@redhat.com

- What I did

- How to verify it

- Description for the changelog

@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 8, 2021
@rphillips
Copy link
Copy Markdown
Contributor

/retitle Bug 1947684: delay kubelet config readiness until after pools and controller config are ready
/retest

@openshift-ci-robot openshift-ci-robot changed the title WIP: bump up maxRetries to 18 in kubelet config controller Bug 1947684: delay kubelet config readiness until after pools and controller config are ready Apr 9, 2021
@openshift-ci-robot openshift-ci-robot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Apr 9, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: This pull request references Bugzilla bug 1947684, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1947684: delay kubelet config readiness until after pools and controller config are ready

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Apr 9, 2021
@rphillips
Copy link
Copy Markdown
Contributor

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Apr 9, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rphillips: This pull request references Bugzilla bug 1947684, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

No GitHub users were found matching the public email listed for the QA contact in Bugzilla (schoudha@redhat.com), skipping review request.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rphillips
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
@rphillips
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
Copy link
Copy Markdown
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Test needs to pass first and would like to see e2e-gcp-op passes.

@rphillips
Copy link
Copy Markdown
Contributor

@harche Can you add this to the PR:

diff --git a/pkg/controller/kubelet-config/kubelet_config_controller_test.go b/pkg/controller/kubelet-config/kubelet_config_controller_test.go
index f5c0c960..2bb29378 100644
--- a/pkg/controller/kubelet-config/kubelet_config_controller_test.go
+++ b/pkg/controller/kubelet-config/kubelet_config_controller_test.go
@@ -11,6 +11,7 @@ import (
        osev1 "github.com/openshift/api/config/v1"
        oseconfigfake "github.com/openshift/client-go/config/clientset/versioned/fake"
        oseinformersv1 "github.com/openshift/client-go/config/informers/externalversions"
+       corev1 "k8s.io/api/core/v1"
        "k8s.io/apimachinery/pkg/api/equality"
        metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
        "k8s.io/apimachinery/pkg/runtime"
@@ -111,6 +112,25 @@ func newControllerConfig(name string, platform osev1.PlatformType) *mcfgv1.Contr
                                },
                        },
                },
+               Status: mcfgv1.ControllerConfigStatus{
+                       Conditions: []mcfgv1.ControllerConfigStatusCondition{
+                               {
+                                       Type:    mcfgv1.TemplateControllerCompleted,
+                                       Status:  corev1.ConditionTrue,
+                                       Message: "",
+                               },
+                               {
+                                       Type:    mcfgv1.TemplateControllerRunning,
+                                       Status:  corev1.ConditionFalse,
+                                       Message: "",
+                               },
+                               {
+                                       Type:    mcfgv1.TemplateControllerFailing,
+                                       Status:  corev1.ConditionFalse,
+                                       Message: "",
+                               },
+                       },
+               },
        }
        return cc
 }

@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
@rphillips
Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we gating the kubeletconfig on controllerconfig, since I don't think they depend on each other at all? Is this basically trying to wait until the initial configuration for a cluster has settled before applying master kubeletconfigs? In that case wouldn't this cause 1 extra reboot for masters during install, because the kubeletconfig sometimes doesn't get applied until initial configuration is generated, which is somewhat counter to what we want? (maybe I'm mis-understanding something here)

I feel like for the bug we should fix it in a way such that the bootstrap already handles kubeletconfigs, such that the initial pool configuration has the kubeletconfig in place.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the commit message such that you mention why this is being bumped

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The kubelet config is dependent on the controller config here: https://github.com/harche/machine-config-operator/blob/f28b2c2a4102228af4b6754522e13384711969e7/pkg/controller/kubelet-config/kubelet_config_controller.go#L321-L341

Ideally, initial rollout would be nice, but that isn't how the feature works today. Changes to kubelet configurations require a reboot.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah my bad. I missed that connection. I think this code does make sense to make the wait explicit. that said, I'm not sure if this will achieve the desired effect of waiting for the initial configuration of the pool to settle (the linked bugzilla), since

return ctrl.syncStatusOnly(cfg, err, "could not get kubelet config key: %v", err)

was blocking it before now anyways. The race between the KCC generating MCs from kubeletconfigs and the in-cluster MCC generating the initial node configuration still exists I think.

If we want to go this route, maybe

we should instead check if the pool has a rendered config set to the pool's spec. Then the kubeletconfig should always get generated after an initial rendered configs gets generated (we want that, since it won't contain kubeletconfigs like bootstrap), and thus bypass the "not found" issue.

Alternatively, I still think it might be a cleaner solution just to add a bootstrap reader mode to the kubeletconfigcontoller and have the bootstrap MCC call it (achieving day 1), which if we want to do in the future, we'd have to revert our proposed fix here,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we need to update the code to run at bootstrap. Perhaps @QiWang19 can look into updating the KubeletConfig as a second pass. I would like to unblock Omer.

The node_controller does something similar to see if the pool is ready...
https://github.com/harche/machine-config-operator/blob/5527f842eda1fd51538590a592d3c95a3cd679d9/pkg/controller/node/node_controller.go#L720-L727

Signed-off-by: Ryan Phillips <rphillip@redhat.com>
@openshift-ci-robot openshift-ci-robot removed the lgtm Indicates that a PR is ready to be merged. label Apr 9, 2021
@rphillips
Copy link
Copy Markdown
Contributor

/retest

Copy link
Copy Markdown
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving since this seems to fix the bootstrap race condition for kubeletconfigs.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 13, 2021
@sinnykumari
Copy link
Copy Markdown
Contributor

/lgtm
/retest

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 13, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: harche, rphillips, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 13, 2021

@harche: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/okd-e2e-aws 5527f84 link /test okd-e2e-aws

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 3fe1270 into openshift:master Apr 13, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@harche: All pull requests linked via external trackers have merged:

Bugzilla bug 1947684 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1947684: delay kubelet config readiness until after pools and controller config are ready

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants