Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval#2380
Conversation
|
/cherrypick release-4.6 |
|
@rvanderp3: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rvanderp3: This pull request references Bugzilla bug 1920807, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/assign @jcpowermac |
|
/test e2e-vsphere |
|
(need to make sure the actual platform affected by the PR has a ci run ^^^) |
|
@rvanderp3 can you please update the commit to indicate this is a vsphere pr? /hold |
ofcourse. doing it now. |
4817163 to
823fe3a
Compare
There was a problem hiding this comment.
The effect of this will be to set both the transient and the static hostname which will override and DHCP hostnames.
The transient hostname would change if the user removed the hostname from VSphere. In this case, both the transient hostname and the static hostname will be updated to whatever VSphere tells us. For other platforms, we've explicitly used --transient so all the forms of dynamic hostname discovery continues to work.
I would not recommend backporting this change.
There was a problem hiding this comment.
The issue we are seeing in 4.6.9 and later is if there is a hostname provided by DHCP, that hostname will be preferred over the one provided by the machine-api/vsphere configuration as network manager starts after the hostname is set by the vsphere-hostname service. As a result, the machine-api will fail to reconcile the node as the hostname does not match what is expected and scaling machinesets fails.
There was a problem hiding this comment.
@darkmuggle Your explanation makes a lot of sense. I just verified that this issue can be addressed by changing the vsphere-hostname service to start after node-valid-hostname.service and NetworkManager.service. The --static flag was removed as part of this test as well. Would that be a preferred path to addressing this issue?
There was a problem hiding this comment.
I spoke a little too soon on that, it does address the issue, but there is a race condition depending on when NetworkManager sets the hostname which results in the DHCP hostname being applied some of the time.
There was a problem hiding this comment.
In general, we want to use DHCP over the VSphere hostname. The easier change would be to change https://github.com/openshift/machine-config-operator/blob/master/templates/common/vsphere/units/vsphere-hostname.service.yaml#L8 to After
There was a problem hiding this comment.
@rvanderp3 collaborated via Slack that gets around the whole hostname dance.
There was a problem hiding this comment.
Thanks! I'm testing things out now.
There was a problem hiding this comment.
Unfortunately, the change did not seem to resolve the issue. CSRs still had to be manually approved in order for an machineset created node to join the cluster. The vSphere cloud provider is retrieving the node name from the from the hostnamevsphere.go#L314. The hostname-override does seem to update the hostname in the v1.Node resource, but does not have an impact on the node name reported by the kubelet. The kube doc suggests the hostname-override flag may not be effective if there is a configured cloud provider and I think that may be what we are hitting here.
There was a problem hiding this comment.
You are indeed right @rvanderp3. And after a deep dive on the Cloud provider plugin, the kubelet and the golang os package -- sadly we will need to set the static name.
I pinged over a solution that I think should be an effective compromise.
There was a problem hiding this comment.
Just tested the fix on my cluster where I could reproduce the problem and things look good. CSRs are auto-approving as expected on a scale up.
|
/test e2e-vsphere |
823fe3a to
546631b
Compare
|
/test e2e-vsphere |
|
/hold |
546631b to
0434df5
Compare
|
@rvanderp3: This pull request references Bugzilla bug 1920807, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/hold cancel |
|
/bugzilla refresh |
|
@rvanderp3: This pull request references Bugzilla bug 1920807, which is valid. 3 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
I'll defer to @darkmuggle and @jcpowermac for reviews |
|
/test e2e-vsphere-upi |
@rvanderp3 and I talked yesterday about this PR. I would like to see the results of UPI though |
We should be okay. The hostname needs to match the VM name for the certificate approval for the Kubelet and changing the hostname after the fact usually makes the kubelet unhappy. |
darkmuggle
left a comment
There was a problem hiding this comment.
/lgtm
@rvanderp3 thank you to you and your team for the patience in working through this.
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/lgtm On slack with @rvanderp3 tested with UPI, no need to wait for CI. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: darkmuggle, jcpowermac, rvanderp3 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@rvanderp3: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@rvanderp3: All pull requests linked via external trackers have merged: Bugzilla bug 1920807 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@rvanderp3: new pull request created: #2404 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-4.7 |
|
@rvanderp3: new pull request created: #2405 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes: BZ1920807 - when creating a new machine, the node will get an unexpected hostname
- What I did
Added the --static flag to
hostnamectlto prevent DHCP from overriding the hostname sourced from guestinfo. This is only done once on vsphere nodes on the first boot.- How to verify it
<cluster>-<clusterid>-worker-<name>- Description for the changelog
Resolves issue where the hostname provided by the DHCP overrides the guestinfo hostname thus preventing the machine-api-controller from reconciling new nodes which receive a hostname from DHCP.