Skip to content

feat(vmware): Support network events#6063

Merged
TheRealFalcon merged 1 commit into
canonical:mainfrom
akutz:feature/ds-vmw-reconfig-net
Mar 18, 2025
Merged

feat(vmware): Support network events#6063
TheRealFalcon merged 1 commit into
canonical:mainfrom
akutz:feature/ds-vmw-reconfig-net

Conversation

@akutz
Copy link
Copy Markdown
Contributor

@akutz akutz commented Feb 28, 2025

Proposed Commit Message

feat(vmware): Support network events

This patch updates the DataSource for VMware to support network
reconfiguration when the BOOT, BOOT_NEW_INSTANCE, and HOTPLUG
events are received.

Previously the datasource could only reconfigure the network if
a new instance ID was detected. However, due to features like
backup/restore, migrating VMs, etc., it was determined that it is
valuable to support reconfiguring the network without changing the
instance ID. This is because changing the instance ID also means
running the per-instance configuration modules, such as regenerating
the system's SSH host keys, which could lock out automation.

Additional Context

NA

Test Steps

make clean_pyc && PYTHONPATH="$(pwd)" python3 -m pytest -v tests/unittests/sources/test_vmware.py
make clean_pyc && PYTHONPATH="$(pwd)" python3 -m pytest -v tests/unittests/test_upgrade.py

Merge type

  • Squash merge using "Proposed Commit Message"
  • Rebase and merge unique commits. Requires commit messages per-commit each referencing the pull request number (#<PR_NUM>)

Fixes #5729

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Feb 28, 2025

cc @PengpengSun

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 7 times, most recently from ae2d751 to 429877d Compare February 28, 2025 18:08
@github-actions github-actions Bot added the documentation This Pull Request changes documentation label Feb 28, 2025
@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch from 429877d to e9917c1 Compare February 28, 2025 22:06
@PengpengSun
Copy link
Copy Markdown
Contributor

PengpengSun commented Mar 3, 2025

Hi @akutz
I think this will address this cloud-init issue #5729, we are good when transport is guestinfo, metadata can be updated before HotPlug. But with imc transport, I need figure out how HotPlug work with live customization, usually a hardware change is before guest customization.

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 3, 2025

Hi @akutz I think this will address this cloud-init issue #5729, we are good when transport is guestinfo, metadata can be updated before HotPlug. But with imc transport, I need figure out how HotPlug work with live customization, usually a hardware change is before guest customization.

Do you think I should remove HOTPLUG if the transport is IMC then?

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch from e9917c1 to 7351b91 Compare March 3, 2025 13:55
@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 3, 2025

Hi @akutz I think this will address this cloud-init issue #5729, we are good when transport is guestinfo, metadata can be updated before HotPlug. But with imc transport, I need figure out how HotPlug work with live customization, usually a hardware change is before guest customization.

Do you think I should remove HOTPLUG if the transport is IMC then?

@PengpengSun I updated the PR so that HOTPLUG is only supported if the datasource is not using the IMC transport.

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 2 times, most recently from d35e4a3 to 9be437d Compare March 3, 2025 13:59
@PengpengSun
Copy link
Copy Markdown
Contributor

Hi @akutz I think this will address this cloud-init issue #5729, we are good when transport is guestinfo, metadata can be updated before HotPlug. But with imc transport, I need figure out how HotPlug work with live customization, usually a hardware change is before guest customization.

Do you think I should remove HOTPLUG if the transport is IMC then?

@PengpengSun I updated the PR so that HOTPLUG is only supported if the datasource is not using the IMC transport.

@akutz Please keep HOTPLUG event support for IMC, although IMC hasn't supported live metadata update (which means delegating live customization cfg to cloud-init). Another feature to be done, when it's ready, HOTPLUG event will be supported by IMC transport directly. For now, guestinfo transport is in front of IMC and both transports requires VMware platform, so we are good.

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch from 9be437d to 827a6ef Compare March 5, 2025 14:01
@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 5, 2025

Hi @akutz I think this will address this cloud-init issue #5729, we are good when transport is guestinfo, metadata can be updated before HotPlug. But with imc transport, I need figure out how HotPlug work with live customization, usually a hardware change is before guest customization.

Do you think I should remove HOTPLUG if the transport is IMC then?

@PengpengSun I updated the PR so that HOTPLUG is only supported if the datasource is not using the IMC transport.

@akutz Please keep HOTPLUG event support for IMC, although IMC hasn't supported live metadata update (which means delegating live customization cfg to cloud-init). Another feature to be done, when it's ready, HOTPLUG event will be supported by IMC transport directly. For now, guestinfo transport is in front of IMC and both transports requires VMware platform, so we are good.

Done.

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 3 times, most recently from 31c0cc7 to 5e58f1c Compare March 5, 2025 14:18
Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, by default and without any additional user data, you want cloud-init to

  1. Re-fetch all IMDS metadata every single boot, and
  2. Re-fetch network metadata every time there is a network add or remove event, including virtual network interfaces such as those used by docker

I'm asking generally because it seems to be a known issue that metadata sources aren't always available on the VMware platform. Is the current code to determine whether to used the cached datasource meant to handle these additional use cases?

In the past, item 2 has caused a critical bug on EC2 that resulted in restricting hotplug to specific drivers. Is this also a valid concern for VMware?

Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 5, 2025

Just to be clear, by default and without any additional user data, you want cloud-init to

  1. Re-fetch all IMDS metadata every single boot, and
  2. Re-fetch network metadata every time there is a network add or remove event, including virtual network interfaces such as those used by docker

Except for the bit about Docker (see below, I will address this the same way Ec2 did), yeah. This is for a few reasons. When we backup/restore VMs and/or replicate them to other sites, we want to have them update their network configurations. However, today the only way we can do that is by changing the instance ID, which is bad, mmmm kay? It causes things like runcmds to run again and SSH host keys to SSH host key again. All bad stuff for the most part, at least from the user's perspective. Since we do most of this in a Kubernetes controller we don't know if the VM was restored and don't know if it was replicated, so we Cloud-Init to check the network config at boot to verify all is spiffy in who-ville.

Let's say we knew about these events (restore, replication), is the user data section for configuring allowed events read at each boot? Could that be used via vendordata to tell the next boot to reconfigure networking?

I'm asking generally because it seems to be a known issue that metadata sources aren't always available on the VMware platform. Is the current code to determine whether to used the cached datasource meant to handle these additional use cases?

The above was specific to the IMC transport. The metadata should always available via the other transports as long as it is not redacted. I will add detection to determine if the network data is in the metadata with something like:

if "network" in self.metadata

And change the supported events to None if there is no network configuration present in the metadata.

In the past, item 2 has caused a critical bug on EC2 that resulted in restricting hotplug to specific drivers. Is this also a valid concern for VMware?

Good call. I'll use the same defaults and make the list of drivers configurable via a metadata option. Thanks @TheRealFalcon!

@TheRealFalcon TheRealFalcon self-assigned this Mar 6, 2025
@PengpengSun
Copy link
Copy Markdown
Contributor

PengpengSun commented Mar 6, 2025

Just to be clear, by default and without any additional user data, you want cloud-init to

  1. Re-fetch all IMDS metadata every single boot, and

Hi @akutz and @TheRealFalcon ,

What if there is persistent metadata in guestinfo and instance has the "second" boot, will it apply metadata again to this instance? especially network in metadata. If yes, it will overwrite network configuration which customer manually set between "first" and "second" boot.

  1. Re-fetch network metadata every time there is a network add or remove event, including virtual network interfaces such as those used by docker

Except for the bit about Docker (see below, I will address this the same way Ec2 did), yeah. This is for a few reasons. When we backup/restore VMs and/or replicate them to other sites, we want to have them update their network configurations. However, today the only way we can do that is by changing the instance ID, which is bad, mmmm kay? It causes things like runcmds to run again and SSH host keys to SSH host key again. All bad stuff for the most part, at least from the user's perspective. Since we do most of this in a Kubernetes controller we don't know if the VM was restored and don't know if it was replicated, so we Cloud-Init to check the network config at boot to verify all is spiffy in who-ville.

Let's say we knew about these events (restore, replication), is the user data section for configuring allowed events read at each boot? Could that be used via vendordata to tell the next boot to reconfigure networking?

I'm asking generally because it seems to be a known issue that metadata sources aren't always available on the VMware platform. Is the current code to determine whether to used the cached datasource meant to handle these additional use cases?

The above was specific to the IMC transport. The metadata should always available via the other transports as long as it is not redacted. I will add detection to determine if the network data is in the metadata with something like:

Yes, for IMC transport, we don't want to instance re-fetch all IMDS metadata every single boot. My understanding is that with the current code to determine whether to used the cached datasource, if cached data_access_method == DATA_ACCESS_METHOD_IMC, cloud-init will fallback to cache but re-fetch all IMDS metadata.

if "network" in self.metadata

And change the supported events to None if there is no network configuration present in the metadata.

In the past, item 2 has caused a critical bug on EC2 that resulted in restricting hotplug to specific drivers. Is this also a valid concern for VMware?

Good call. I'll use the same defaults and make the list of drivers configurable via a metadata option. Thanks @TheRealFalcon!

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 6, 2025

What if there is persistent metadata in guestinfo and instance has the "second" boot, will it apply metadata again to this instance? especially network in metadata. If yes, it will overwrite network configuration which customer manually set between "first" and "second" boot.

@TheRealFalcon Is Cloud-Init smart enough to detect that the network configuration has not changed, even if we request the BOOT event? But even if Cloud-Init does configure it again, is that not the goal? If the network data has not changed, then even if it overwrites the network configuration, it is the same.

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 6, 2025

Yes, for IMC transport, we don't want to instance re-fetch all IMDS metadata every single boot. My understanding is that with the current code to determine whether to used the cached datasource, if cached data_access_method == DATA_ACCESS_METHOD_IMC, cloud-init will fallback to cache but re-fetch all IMDS metadata.

@PengpengSun Let's sync Monday on the above. It sounds like you need me to update the PR to indicate the allowed events are None if IMC is used and there is no cached data?

@TheRealFalcon
Copy link
Copy Markdown
Contributor

Is Cloud-Init smart enough to detect that the network configuration has not changed, even if we request the BOOT event? But even if Cloud-Init does configure it again, is that not the goal? If the network data has not changed, then even if it overwrites the network configuration, it is the same.

It will refetch the configuration every boot and depending on the config being rendered, it may check the diff between the two configs to see if it needs to write out a new config. But you're correct; when the network data hasn't changed, it doesn't matter if we re-write the same config because the config is the same.

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 6, 2025

It will refetch the configuration every boot

Is this so bad though? I mean on your end that is. The cost on our end is what it is, and it's not network based, so an RPC call per boot is not the worst thing. But is it super expensive upstream of the data source? For example, does Cloud-Init need to transform the network v2 data we give back via the network renderer before it can do the diff you mentioned? If that is the case, then I can see how that adds to the cost.

Would there be value in our DS keeping its own receipt to track whether or not the network has changed?

The more I think about this, the more I wonder if we should set the SUPPORTED_EVENTS to everything but set the DEFAULT_EVENTS to just boot-new-instance and have VM Service use vendordata to add boot and hotplug. That way we can limit the scope of this to VMs booted via the VM Service platform rather than all VMs booted on vSphere.

Still, VM Service is the future of vSphere, so it may not matter in the long run, and I like that we don't have to do anything with vendordata since we don't use it yet for anything. It's always there in the back pocket. Yet, I've been hesitant to use it since the docs have always indicated that vendordata can be blocked at the guest level, and that no one should use it if something is required to occur.

@PengpengSun
Copy link
Copy Markdown
Contributor

Yes, for IMC transport, we don't want to instance re-fetch all IMDS metadata every single boot. My understanding is that with the current code to determine whether to used the cached datasource, if cached data_access_method == DATA_ACCESS_METHOD_IMC, cloud-init will fallback to cache but re-fetch all IMDS metadata.

@PengpengSun Let's sync Monday on the above. It sounds like you need me to update the PR to indicate the allowed events are None if IMC is used and there is no cached data?

@akutz If there is no cache, both IMC and guestinfo transports are ok to re-fetch metadata, the allowed events is at least BOOT_NEW_INSTANCE and as we discussed HOTPLUG is fine for both transports too. Yes, let's sync this offline.

@TheRealFalcon
Copy link
Copy Markdown
Contributor

Is this so bad though? I mean on your end that is. The cost on our end is what it is, and it's not network based, so an RPC call per boot is not the worst thing. But is it super expensive upstream of the data source?

I'm not really worried about the BOOT event cost-wise. It could add a second or two to boot, but otherwise there's not really a cost for cloud-init. Many other datasources have moved to doing this. I was more just double checking that the caching behavior has been considered and that there are no big blockers there. If things are ok from your end, then I'm ok with the change.

For example, does Cloud-Init need to transform the network v2 data we give back via the network renderer before it can do the diff you mentioned? If that is the case, then I can see how that adds to the cost.

Yes, but that's fairly trivial time-wise.

@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 12, 2025

Hi @TheRealFalcon,

I am really struggling to understand the difference between supported_update_events and default_update_events. I thought I understood it to be the following:

  • supported_update_events -- what is possible, but not necessarily enabled
  • default_update_events -- when merged with user data, what is actually enabled

However, if you look through the codebase, there are placed where get_supported_events is checked that never reference default_update_events. For instance:

...boolean ands the result with update_metadata_if_supported which uses supported_update_events...

Never mind, I missed that dangling boolean and comparison 😄

Can you please explain the difference?

@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 3 times, most recently from a9216c8 to 5f72f90 Compare March 12, 2025 18:05
@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 12, 2025

@TheRealFalcon I made the requested changes. @PengpengSun and I also decided that by default we will not activate the boot event, leaving that up to VM Service to activate with vendor data. Thanks!

Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 2 times, most recently from e7034c6 to 02ab924 Compare March 12, 2025 19:54
@akutz akutz requested a review from TheRealFalcon March 14, 2025 12:42
Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking pretty good! I left a few mostly minor inline comments. Are you wanting another review from a VMware person before we wrap this one up?

Comment thread tests/unittests/sources/test_vmware.py
Comment thread tests/unittests/sources/test_vmware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py Outdated
Comment thread cloudinit/sources/DataSourceVMware.py
Comment thread cloudinit/sources/DataSourceVMware.py
@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 2 times, most recently from db7232f to 01727ae Compare March 18, 2025 14:14
@akutz akutz requested a review from TheRealFalcon March 18, 2025 14:14
@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch 2 times, most recently from c22cdc5 to bcb2b1e Compare March 18, 2025 14:32
@akutz
Copy link
Copy Markdown
Contributor Author

akutz commented Mar 18, 2025

Looking pretty good! I left a few mostly minor inline comments. Are you wanting another review from a VMware person before we wrap this one up?

@PengpengSun is good with it, so once you are, please go ahead and merge. Thank you!

This patch updates the DataSource for VMware to support network
reconfiguration when the BOOT, BOOT_NEW_INSTANCE, and HOTPLUG
events are received.

Previously the datasource could only reconfigure the network if
a new instance ID was detected. However, due to features like
backup/restore, migrating VMs, etc., it was determined that it is
valuable to support reconfiguring the network without changing the
instance ID. This is because changing the instance ID also means
running the per-instance configuration modules, such as regenerating
the system's SSH host keys, which could lock out automation.
@akutz akutz force-pushed the feature/ds-vmw-reconfig-net branch from bcb2b1e to 787276c Compare March 18, 2025 14:44
Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@TheRealFalcon TheRealFalcon merged commit 70c239b into canonical:main Mar 18, 2025
@xiachen-rh
Copy link
Copy Markdown
Contributor

Hi @akutz and @PengpengSun, I have a question about IMC transports, during the OS upgrade, cloud-init would clean cache when a Python version change is detected(#857), how could cloud-init re-fetch metadata? Based on my test, it is fallback to DataSourceNone (cloud-init 24.4), does this feature(#6063) support re-fetching metadata when using IMC transports?

Yes, for IMC transport, we don't want to instance re-fetch all IMDS metadata every single boot. My understanding is that with the current code to determine whether to used the cached datasource, if cached data_access_method == DATA_ACCESS_METHOD_IMC, cloud-init will fallback to cache but re-fetch all IMDS metadata.

@PengpengSun Let's sync Monday on the above. It sounds like you need me to update the PR to indicate the allowed events are None if IMC is used and there is no cached data?

@akutz If there is no cache, both IMC and guestinfo transports are ok to re-fetch metadata, the allowed events is at least BOOT_NEW_INSTANCE and as we discussed HOTPLUG is fine for both transports too. Yes, let's sync this offline.

@PengpengSun
Copy link
Copy Markdown
Contributor

Hi @xiachen-rh This feature(#6063) does NOT support re-fetching metadata on HOTPLUG for IMC transport, the reason is the data from IMC transport is not persistent, re-fetching metadata will not load data from IMC transport unless vm is customized along with HOTPLUG.
Fall back to cached local ds if no valid ds found (#4997) depends on whether the cache is available and whether the cached data is IMC transport on VMware platform. In case of cache is cleaned when a Python version change is detected(#857), fall back to cached IMC transport is impossible and cloud-init will fallback to DataSourceNone.
For RedHat vm, a workaround I can see is removing disable_vmware_customization: false flag from /etc/cloud/cloud.cfg, I think this will disable cloud-init since no possible DS found during ds-identify execution.
Another way is customzing the VM again after the OS upgrade to re-generate the cloud-init cache.

@xiachen-rh
Copy link
Copy Markdown
Contributor

xiachen-rh commented Apr 9, 2025

For RedHat vm, a workaround I can see is removing disable_vmware_customization: false flag from /etc/cloud/cloud.cfg, I think this will disable cloud-init since no possible DS found during ds-identify execution.

Thanks @PengpengSun for your help, and good to know this setting, and I tested, cloud-init is disabled after OS upgrade reboot, I think it can fit the upgrade scenario well, we will document it, thanks!

@ani-sinha
Copy link
Copy Markdown
Contributor

For RedHat vm, a workaround I can see is removing disable_vmware_customization: false flag from /etc/cloud/cloud.cfg, I think this will disable cloud-init since no possible DS found during ds-identify execution.

Thanks @PengpengSun for your help, and good to know this setting, and I tired, cloud-init is disabled after OS upgrade reboot, I think it can fit the upgrade scenario well, we will document it, thanks!

But don't we want cloud-init enabled after OS upgrade? I am confused.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation This Pull Request changes documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[enhancement]: hotplug in VMWare datasource

6 participants