Skip to content

templates: Set vSphere hostname from guestinfo before NM starts#2282

Merged
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
LorbusChris:patch-1
Dec 5, 2020
Merged

templates: Set vSphere hostname from guestinfo before NM starts#2282
openshift-merge-robot merged 1 commit intoopenshift:masterfrom
LorbusChris:patch-1

Conversation

@LorbusChris
Copy link
Copy Markdown
Contributor

@LorbusChris LorbusChris commented Dec 3, 2020

In Fedora 33, a change was introduced to set the hostname to "fedora" as a fallback instead of the default "localhost", so the current script does not work in OKD as NM will assign the fallback hostname.

This change makes it so the hostname is set to vSphere's
guestinfo.hostname if that's defined, before NetworkManager is started.

Copy link
Copy Markdown
Member

@cgwalters cgwalters left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sane to me, though it'd be a bit more precise to conditionalize the fedora check to only if the OS ID in /usr/lib/os-release is fedora too.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 3, 2020
@LorbusChris
Copy link
Copy Markdown
Contributor Author

I wonder if we still need this given this was fixed for the reporter with a newer version of NM: okd-project/okd#394 -- but on second thought, as this seems like a fallback solution anyway and in order to make it work we actually do need it.

I'll add the os-release check here :)

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

vsphere team should sign off on this:

/assign @jcpowermac @patrickdillon

Copy link
Copy Markdown

@darkmuggle darkmuggle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

"I wonder if we still need this given this was fixed for the reporter with a newer version of NM"

If the report says it was fixed in a newer version of NM, then I think we should not ship this.

The same objection over magic strings apply -- this may seem "safe" but I would much rather universally set the transient name from the meta-data and then use NM over any magic string checking.

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 3, 2020
@fortinj66
Copy link
Copy Markdown
Contributor

fortinj66 commented Dec 3, 2020

/hold

"I wonder if we still need this given this was fixed for the reporter with a newer version of NM"

If the report says it was fixed in a newer version of NM, then I think we should not ship this.

The same objection over magic strings apply -- this may seem "safe" but I would much rather universally set the transient name from the meta-data and then use NM over any magic string checking.

The NM fix does not fix the issue I am seeing... That fix is for DHCP resolved hostnames. This is for vSphere resolved hostnames and fails when 'fedora' is the hostname. Checking for fedora is no better or worst than checking for localhost unless a better way is determined to fix this.

@darkmuggle
Copy link
Copy Markdown

darkmuggle commented Dec 3, 2020

Checking for fedora is no better or worst than checking for localhost unless a better way is determined to fix this.

The better way is to:

  1. always set the transient name directly from the meta-data (no check on localhost or Fedora
  2. Let NM come up and do its thing. Since the transient name is set directly from the VSphere meta-data then the hostname will always be set.

Fedora or localhost are both magic strings, and the fine Systemd folks have informed us that we need to stop checking magic strings. I've ranted on this in other bugs, but the problem is that we keep working around distro bugs, instead of fixing it in a better place. Fedora/OKD is supposed to be the test bed for RHCOS/OCP, so lets do it in a better place.

See #2217

@fortinj66
Copy link
Copy Markdown
Contributor

fortinj66 commented Dec 3, 2020

so rather than have the test

 hostname=$(hostname -s)
    if [ "${hostname}" = "localhost" ] || [ "${hostname}" = "fedora" ]; then
      if guestinfo_hostname=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname'); then
          /usr/bin/hostnamectl --transient --static set-hostname ${guestinfo_hostname}
      fi
    fi

have only

vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname')
 /usr/bin/hostnamectl --transient --static set-hostname ${vm_name}

or maybe without --static

which always forces it to the vm_name
and then change it again if NM gets an updated hostname from "somewhere" ?

@fortinj66
Copy link
Copy Markdown
Contributor

I noticed the guestinfo.hostname attribute is created by the openshift installer when it creates the vm ( or by the openshift api when it creates worker nodes) so there is a good chance that non IPI hosts will not have this entry...

@vrutkovs
Copy link
Copy Markdown
Contributor

vrutkovs commented Dec 3, 2020

there is a good chance that non IPI hosts will not have this entry

UPI installs are expected to have hostname pre-set via kernel args / NM config / DHCP / PTR records / guestinfo.hostname. IPI is relying on guestinfo, so it should be sufficient.

@darkmuggle
Copy link
Copy Markdown

darkmuggle commented Dec 3, 2020

I would use

vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname')
/usr/bin/hostnamectl --transient set-hostname ${vm_name}

If you use --static then NM will not update the hostname at all. --static would be hard NACK this late in the 4.7 cycle since it would change hostname discovery.

Also [1] will need to be updated so that the unit starts before node-valid-hostname.service and Network-Manager.service. Looking at [2] indicates that the current configuration is prone to a racy condition where NetworkManager, node-valid-hostname.service and the current vsphere-hostname.service could each alter the name.

[1] https://github.com/openshift/machine-config-operator/blob/f949f7efb3b550b5762b513bf5e77bcbff1bc8d6/templates/common/vsphere/units/vsphere-hostname.service.yaml#L7
[2] https://src.fedoraproject.org/rpms/open-vm-tools/blob/HEAD/f/vmtoolsd.service

@fortinj66
Copy link
Copy Markdown
Contributor

fortinj66 commented Dec 3, 2020

UPI installs are expected to have hostname pre-set via kernel args / NM config / DHCP / PTR records / guestinfo.hostname. IPI is relying on guestinfo, so it should be sufficient.

So really we could/should do something like:

if vm_name=$(/bin/vmtoolsd --cmd 'info-get guestinfo.hostname'); then
    /usr/bin/hostnamectl --transient set-hostname ${vm_name}
fi

just in case guestinfo.hostname doesn't exist

@LorbusChris
Copy link
Copy Markdown
Contributor Author

@darkmuggle please take another look

@LorbusChris
Copy link
Copy Markdown
Contributor Author

/test e2e-vsphere

@LorbusChris LorbusChris changed the title templates: Update vsphere-hostname.sh to account for "fedora" fallback hostname templates: Set vSphere hostname from guestinfo before NM starts Dec 4, 2020
This change makes it so the hostname is set to vSphere's
guestinfo.hostname if that's defined, before NetworkManager is started.
@LorbusChris
Copy link
Copy Markdown
Contributor Author

/test e2e-vsphere

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

@LorbusChris et al, as a reminder the MCO doesn't own this template, so you will need to make sure a vsphere team member takes a look and LGTMs this. I've pinged in their slack channel but might require further followup..

@jcpowermac
Copy link
Copy Markdown
Contributor

LGTM, will let @patrickdillon review as well.
We should also test UPI since this change will effect that as well.

/test e2e-vsphere-upi

Can/should this be change to use afterburn instead? This was a last minute addition to get vSphere IPI working. I think the original intention was to eventually move to afterburn if it would work for this purpose.

@vrutkovs
Copy link
Copy Markdown
Contributor

vrutkovs commented Dec 4, 2020

Can/should this be change to use afterburn instead?

I think we can experiment with afterburn in okd-machine-os repo (and later propagate that to MCO)

@fortinj66
Copy link
Copy Markdown
Contributor

I ran/tested the script on my hostname broken cluster and it works fine. Cluster generated with proper hostnames after reboot.

hostnamectl shows the following:

[core@dev-c1v4-7gwcb-master-1 ~]$ hostnamectl
   Static hostname: n/a
Transient hostname: dev-c1v4-7gwcb-master-1
         Icon name: computer-vm
           Chassis: vm
        Machine ID: 6ebb0be1ece646748fd16e9c674d831b
           Boot ID: 917c5b56375e44f79e8a692ea79239b4
    Virtualization: vmware
  Operating System: Fedora CoreOS 33.20201202.10.1
       CPE OS Name: cpe:/o:fedoraproject:fedora:33
            Kernel: Linux 5.9.11-200.fc33.x86_64
      Architecture: x86-64

No static hostname is defined which is expected (I think)

@jcpowermac
Copy link
Copy Markdown
Contributor

UPI installed correctly. Docker rate limiting caused the test failures.

@patrickdillon
Copy link
Copy Markdown
Contributor

patrickdillon commented Dec 4, 2020

LGTM, will let @patrickdillon review as well.

LGTM as well

@darkmuggle
Copy link
Copy Markdown

/unhold

The clean-up looks good from the RHCOS perspective.

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Dec 4, 2020
@LorbusChris
Copy link
Copy Markdown
Contributor Author

/test e2e-aws-workers-rhel7
/test e2e-aws-serial
/test e2e-vsphere-upi

@kikisdeliveryservice
Copy link
Copy Markdown
Contributor

Ok given the consensus here I'll just do the final lgtm so this will make it in

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Dec 4, 2020
@openshift-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cgwalters, kikisdeliveryservice, LorbusChris

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [cgwalters,kikisdeliveryservice]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Copy Markdown
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 193839a into openshift:master Dec 5, 2020
@vrutkovs
Copy link
Copy Markdown
Contributor

vrutkovs commented Dec 6, 2020

/cherrypick release-4.6

@openshift-cherrypick-robot
Copy link
Copy Markdown

@vrutkovs: new pull request created: #2289

Details

In response to this:

/cherrypick release-4.6

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.