Skip to content

Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval#2380

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
rvanderp3:vsphere-hostname-lost
Feb 11, 2021
Merged

Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval#2380
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
rvanderp3:vsphere-hostname-lost

Conversation

@rvanderp3
Copy link
Copy Markdown
Contributor

@rvanderp3 rvanderp3 commented Feb 1, 2021

Fixes: BZ1920807 - when creating a new machine, the node will get an unexpected hostname

- What I did
Added the --static flag to hostnamectl to prevent DHCP from overriding the hostname sourced from guestinfo. This is only done once on vsphere nodes on the first boot.

- How to verify it

  • Configure DHCP to assign a hostname as part of the lease
  • Scale a new node
  • The node should automatically join the cluster with a node name like <cluster>-<clusterid>-worker-<name>

- Description for the changelog
Resolves issue where the hostname provided by the DHCP overrides the guestinfo hostname thus preventing the machine-api-controller from reconciling new nodes which receive a hostname from DHCP.

@rvanderp3 rvanderp3 changed the title set hostname with --static to prevent DHCP from overriding hostname p… Fixes: BZ1920807 - when creating a new machine, the node will get an unexpected hostname Feb 1, 2021
@rvanderp3
Copy link
Copy Markdown
Contributor Author

/cherrypick release-4.6

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rvanderp3: once the present PR merges, I will cherry-pick it on top of release-4.6 in a new PR and assign it to you.

Details

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@darkmuggle darkmuggle changed the title Fixes: BZ1920807 - when creating a new machine, the node will get an unexpected hostname Bug 1920807: when creating a new machine, the node will get an unexpected hostname Feb 1, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rvanderp3: This pull request references Bugzilla bug 1920807, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.7.0) matches configured target release for branch (4.7.0)
  • bug is in the state NEW, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

Bug 1920807: when creating a new machine, the node will get an unexpected hostname

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Feb 1, 2021
@kikisdeliveryservice kikisdeliveryservice changed the title Bug 1920807: when creating a new machine, the node will get an unexpected hostname Bug 1920807: set hostname with --static to prevent DHCP from overriding hostname property Feb 1, 2021
@kikisdeliveryservice kikisdeliveryservice requested review from kikisdeliveryservice and removed request for cgwalters February 1, 2021 18:45
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/assign @jcpowermac

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

/test e2e-vsphere

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

(need to make sure the actual platform affected by the PR has a ci run ^^^)

@kikisdeliveryservice kikisdeliveryservice changed the title Bug 1920807: set hostname with --static to prevent DHCP from overriding hostname property Bug 1920807: [vsphere] set hostname with --static to prevent DHCP from overriding hostname property Feb 1, 2021
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

@rvanderp3 can you please update the commit to indicate this is a vsphere pr?

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2021
@rvanderp3
Copy link
Copy Markdown
Contributor Author

@rvanderp3 can you please update the commit to indicate this is a vsphere pr?

/hold

ofcourse. doing it now.

@rvanderp3 rvanderp3 force-pushed the vsphere-hostname-lost branch from 4817163 to 823fe3a Compare February 1, 2021 18:54
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect of this will be to set both the transient and the static hostname which will override and DHCP hostnames.

The transient hostname would change if the user removed the hostname from VSphere. In this case, both the transient hostname and the static hostname will be updated to whatever VSphere tells us. For other platforms, we've explicitly used --transient so all the forms of dynamic hostname discovery continues to work.

I would not recommend backporting this change.

Copy link
Copy Markdown
Contributor Author

@rvanderp3 rvanderp3 Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue we are seeing in 4.6.9 and later is if there is a hostname provided by DHCP, that hostname will be preferred over the one provided by the machine-api/vsphere configuration as network manager starts after the hostname is set by the vsphere-hostname service. As a result, the machine-api will fail to reconcile the node as the hostname does not match what is expected and scaling machinesets fails.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@darkmuggle Your explanation makes a lot of sense. I just verified that this issue can be addressed by changing the vsphere-hostname service to start after node-valid-hostname.service and NetworkManager.service. The --static flag was removed as part of this test as well. Would that be a preferred path to addressing this issue?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spoke a little too soon on that, it does address the issue, but there is a race condition depending on when NetworkManager sets the hostname which results in the DHCP hostname being applied some of the time.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we want to use DHCP over the VSphere hostname. The easier change would be to change https://github.com/openshift/machine-config-operator/blob/master/templates/common/vsphere/units/vsphere-hostname.service.yaml#L8 to After

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rvanderp3 collaborated via Slack that gets around the whole hostname dance.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'm testing things out now.

Copy link
Copy Markdown
Contributor Author

@rvanderp3 rvanderp3 Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, the change did not seem to resolve the issue. CSRs still had to be manually approved in order for an machineset created node to join the cluster. The vSphere cloud provider is retrieving the node name from the from the hostnamevsphere.go#L314. The hostname-override does seem to update the hostname in the v1.Node resource, but does not have an impact on the node name reported by the kubelet. The kube doc suggests the hostname-override flag may not be effective if there is a configured cloud provider and I think that may be what we are hitting here.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are indeed right @rvanderp3. And after a deep dive on the Cloud provider plugin, the kubelet and the golang os package -- sadly we will need to set the static name.

I pinged over a solution that I think should be an effective compromise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just tested the fix on my cluster where I could reproduce the problem and things look good. CSRs are auto-approving as expected on a scale up.

@rvanderp3
Copy link
Copy Markdown
Contributor Author

/test e2e-vsphere

@rvanderp3 rvanderp3 force-pushed the vsphere-hostname-lost branch from 823fe3a to 546631b Compare February 1, 2021 22:44
@rvanderp3
Copy link
Copy Markdown
Contributor Author

/test e2e-vsphere

@rvanderp3 rvanderp3 changed the title Bug 1920807: [vsphere] set hostname with --static to prevent DHCP from overriding hostname property Bug 1920807: [vsphere] Change vsphere-hostname service to start after the node-valid-hostname service Feb 1, 2021
@kikisdeliveryservice kikisdeliveryservice removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 1, 2021
@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2021
@rvanderp3
Copy link
Copy Markdown
Contributor Author

/hold

@rvanderp3 rvanderp3 force-pushed the vsphere-hostname-lost branch from 546631b to 0434df5 Compare February 4, 2021 17:18
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rvanderp3: This pull request references Bugzilla bug 1920807, which is invalid:

  • expected the bug to target the "4.8.0" release, but it targets "4.7.0" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

[WIP] Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 10, 2021
@rvanderp3
Copy link
Copy Markdown
Contributor Author

/hold cancel

@rvanderp3 rvanderp3 changed the title [WIP] Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval Feb 10, 2021
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 10, 2021
@rvanderp3
Copy link
Copy Markdown
Contributor Author

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added the bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. label Feb 10, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rvanderp3: This pull request references Bugzilla bug 1920807, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.8.0) matches configured target release for branch (4.8.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)
Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot openshift-ci-robot removed the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Feb 10, 2021
@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

I'll defer to @darkmuggle and @jcpowermac for reviews

@jcpowermac
Copy link
Copy Markdown
Contributor

/test e2e-vsphere-upi

@jcpowermac
Copy link
Copy Markdown
Contributor

I'll defer to @darkmuggle and @jcpowermac for reviews

@rvanderp3 and I talked yesterday about this PR.
I hope the 'on first boot' doesn't cause a problem. Good to see vmtoolds as required and the additional check of virtualization provider.

I would like to see the results of UPI though

@darkmuggle
Copy link
Copy Markdown

I'll defer to @darkmuggle and @jcpowermac for reviews

@rvanderp3 and I talked yesterday about this PR.
I hope the 'on first boot' doesn't cause a problem. Good to see vmtoolds as required and the additional check of virtualization provider.

We should be okay. The hostname needs to match the VM name for the certificate approval for the Kubelet and changing the hostname after the fact usually makes the kubelet unhappy.

Copy link
Copy Markdown

@darkmuggle darkmuggle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@rvanderp3 thank you to you and your team for the patience in working through this.

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Feb 10, 2021
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@jcpowermac
Copy link
Copy Markdown
Contributor

/lgtm

On slack with @rvanderp3 tested with UPI, no need to wait for CI.

@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: darkmuggle, jcpowermac, rvanderp3

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Feb 11, 2021

@rvanderp3: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-ovn-step-registry 12333a3 link /test e2e-ovn-step-registry
ci/prow/okd-e2e-aws 12333a3 link /test okd-e2e-aws
ci/prow/e2e-vsphere-upi 12333a3 link /test e2e-vsphere-upi
ci/prow/e2e-aws-workers-rhel7 12333a3 link /test e2e-aws-workers-rhel7

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 08cfd38 into openshift:master Feb 11, 2021
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

@rvanderp3: All pull requests linked via external trackers have merged:

Bugzilla bug 1920807 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1920807: [vsphere] set hostname with --static to provide consistent node name for CSR approval

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rvanderp3: new pull request created: #2404

Details

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@rvanderp3
Copy link
Copy Markdown
Contributor Author

/cherrypick release-4.7

@openshift-cherrypick-robot
Copy link
Copy Markdown

@rvanderp3: new pull request created: #2405

Details

In response to this:

/cherrypick release-4.7

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-medium Referenced Bugzilla bug's severity is medium for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants