Describe the bug
Hello,
thanks to @lethargosapatheia and issue canonical/cloud-init#4188 in cloud-init repository I was able to find the root cause of the following issue.
Since Ubuntu 22.04 20230602 cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to cd995a5 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed.
This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, kubermatic/machine-controller starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes.
I can confirm the provisioning works properly with version 20230518 of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely.
Environment details
Cloud-init versions are identical on both cloudimage versions:
ii cloud-guest-utils 0.32-22-g45fe84a5-0ubuntu1 all cloud guest utilities
ii cloud-init 23.1.2-0ubuntu0~22.04.1 all initialization and customization tool for cloud instances
ii cloud-initramfs-copymods 0.47ubuntu1 all copy initramfs modules into root filesystem for later use
ii cloud-initramfs-dyn-netconf 0.47ubuntu1 all write a network interface file in /run for BOOTIF
open-vm-tools
Working: VMware Tools Version: 12.1.0.37487 (build-20219665)
Broken: VMware Tools Version: 12.1.5.39265 (build-20735119)
Operating System Distribution: Ubuntu 22.04.2 LTS
Cloud provider, platform or installer type: VMware Cloud Director/OVA
Logs
I'm uploading the relevant logs for both images.
broken.tar.gz
working.tar.gz
Best regards!
Reproduction steps
Working
-
Deploy VM using Ubuntu 22.04 20230518 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
-
Verify cloud-init was able to execute only the init-local stage on first boot:
cloud-init analyze show
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.
Starting stage: init-local
|`->no cache found @00.00300s +00.00000s
|`->found local data from DataSourceOVF @00.01200s +00.00700s
Finished stage: (init-local) 00.25800 seconds
Total Time: 0.25800 seconds
-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.
Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.00900s
|`->found local data from DataSourceOVF @00.01400s +00.00700s
Finished stage: (init-local) 00.06100 seconds
Starting stage: init-network
...
- Verify /var/log/vmware-imc/toolsDeployPkg.log does not contain the following lines:
[2023-08-25T08:18:48.556Z] [ info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [ info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [ info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [ info] Timed out waiting for cloud-init execution done.
Broken
-
Deploy VM using Ubuntu 22.04 20230602 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
-
Verify cloud-init tried to execute multiple stages besides init-local stage on first boot:
-- Boot Record 01 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.
Starting stage: init-local
|`->no cache found @00.00500s +00.00000s
|`->found local data from DataSourceOVF @00.12200s +00.00800s
Finished stage: (init-local) 00.42600 seconds
Starting stage: init-network
...
Finished stage: (init-network) 01.55400 seconds
Starting stage: modules-config
...
Total Time: 14.33200 seconds
-- Boot Record 02 --
The total time elapsed since completing an event is printed after the "@" character.
The time the event takes is printed after the "+" character.
Starting stage: init-local
|`->cache invalid in datasource: DataSourceOVF [seed=com.vmware.guestInfo] @00.00400s +00.01000s
|`->found local data from DataSourceOVF @00.01500s +00.00700s
Finished stage: (init-local) 00.06700 seconds
Starting stage: init-network
...
- Check /var/log/vmware-imc/toolsDeployPkg.log for the following lines:
[2023-08-25T08:18:48.556Z] [ info] Do not trigger reboot if cloud-init is executing.
[2023-08-25T08:18:48.857Z] [ info] Cloud-init status is 'running'.
...
[2023-08-25T08:19:16.029Z] [ info] Cloud-init status is 'running'.
[2023-08-25T08:19:21.029Z] [ info] Timed out waiting for cloud-init execution done.
[2023-08-25T08:19:21.130Z] [ info] Trigger reboot.
[2023-08-25T08:19:22.049Z] [ info] Rebooting.
[2023-08-25T08:19:23.150Z] [ info] Reboot has been triggered.
- Check journalctl for prematurely termination of cloud-init services
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Aug 25 08:19:21 broken systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Aug 25 08:19:21 broken systemd[1]: Stopped Execute cloud user/final scripts.
-- Boot 1673d9c3f266400881bfd1b73933b494 --
Aug 25 08:19:54 broken systemd[1]: Starting Execute cloud user/final scripts...
Expected behavior
No breaking change
I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with wait-cloudinit-timeout=0 just to restore the previous behavior.
Additional context
No response
Describe the bug
Hello,
thanks to @lethargosapatheia and issue canonical/cloud-init#4188 in cloud-init repository I was able to find the root cause of the following issue.
Since Ubuntu 22.04 20230602 cloud image version the behavior of cloud-init has unexpectedly changed. Until this version the virtual machine would start, run cloud-init init-local stage, reboot and run the remaining cloud-init stages correctly. In this version cloud-init starts additional stages besides 'init-local' during first boot (see attached cloud-init analyze show output). During these stages it is terminated prematurely by deployPkg. I found out that this is due to cd995a5 which changed deployPkg plugin behavior to wait for cloud-init execution to finish. If the cloud-init execution is not finished during default timeout 30s it is killed.
This behavior disturbs automatic provisioning, which rely on correct application of cloud-init settings. In my case, kubermatic/machine-controller starts doing the provisioning through userdata and it's interrupted by the reboot, making it impossible to automatically provision new Kubernetes nodes.
I can confirm the provisioning works properly with version 20230518 of Ubuntu 22.04, where cloud-init is executed correctly, without being terminated prematurely.
Environment details
Cloud-init versions are identical on both cloudimage versions:
open-vm-tools
Operating System Distribution: Ubuntu 22.04.2 LTS
Cloud provider, platform or installer type: VMware Cloud Director/OVA
Logs
I'm uploading the relevant logs for both images.
broken.tar.gz
working.tar.gz
Best regards!
Reproduction steps
Working
Deploy VM using Ubuntu 22.04 20230518 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
Verify cloud-init was able to execute only the init-local stage on first boot:
Broken
Deploy VM using Ubuntu 22.04 20230602 cloud image on VMware environment providing userdata which takes longer than 30sec to execute. For example, installing multiple packages and configuring the VM for Kubernetes. I attached an example userdata file.
Verify cloud-init tried to execute multiple stages besides init-local stage on first boot:
Expected behavior
No breaking change
I understand that this change was required to resolve issues where users want to set a vm's networking and apply cloud-init userdata together before the vm is booted. But still I find it bad practice to change the previous default and break previous working configuration for others. Previously I was able to use the cloud image without modification. Now I have to build my own image with
wait-cloudinit-timeout=0just to restore the previous behavior.Additional context
No response