Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions cloudinit/sources/DataSourceAzure.py
Original file line number Diff line number Diff line change
Expand Up @@ -1452,7 +1452,7 @@ def count_files(mp):


@azure_ds_telemetry_reporter
def address_ephemeral_resize(devpath=RESOURCE_DISK_PATH, maxwait=120,
def address_ephemeral_resize(devpath=RESOURCE_DISK_PATH, maxwait=5,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to know whether a VM is one that does not have ephemeral disks?

You mention specific types:

"Dv4, Dsv4, Ev4, Esv4"

Is the instance-type availabe in metadata? If that's available, then one could look up the maxwait value based on instance-type.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no way to know whether a VM is one that does not have ephemeral disks?

You mention specific types:

"Dv4, Dsv4, Ev4, Esv4"

Is the instance-type availabe in metadata? If that's available, then one could look up the maxwait value based on instance-type.

And if there is not a way to know ... can you fix the platform?

Copy link
Copy Markdown
Contributor Author

@johnsonshi johnsonshi Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, the Azure Instance Metadata Service (Azure IMDS) does not expose VM instance metadata indicating whether an ephemeral resource disk exists for the VM or not.

Unfortunately, IMDS support for exposing ephemeral resource disk presence/absence won't be around for quite some time/in the next few months. In the intervening time, Linux VMs deployed without the disks are suffering from a 2-minute delay with cloud-init.

I opened this draft PR ahead of time to communicate our plans:

  1. Decrease wait time for ephemeral disk.
    ** Optional: If the ephemeral disk doesn't come up, then delete the built-in Azure DS cloud-config that references setting up the ephemeral disk. This prevents disk_setup and fs_setup from throwing RuntimeErrors due to referencing non-existent ephemeral disks.
  2. Once IMDS supports exposing this info, we'll have a more graceful approach.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of today, the Azure Instance Metadata Service (Azure IMDS) does not expose VM instance metadata indicating whether an ephemeral resource disk exists for the VM or not.

Unfortunately, IMDS support for exposing ephemeral resource disk presence/absence won't be around for quite some time/in the next few months. In the intervening time, Linux VMs deployed without the disks are suffering from a 2-minute delay with cloud-init.

I was asking if the instance type is available; IIUC, there are new instance types which are without the ephemeral disk;
The instance metadata has[1]:

"vmSize": "Standard_A3"

Can't we set the timeout value low if the vmSize is of the types without the disk?

https://docs.microsoft.com/en-us/azure/virtual-machines/linux/instance-metadata-service

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't take a dependency on the list of VM Sizes since the list of VM Sizes without ephemeral disks will grow as time passes. This design also won't be resilient if Azure ever exposes an option (to users) to deploy VMs with/without ephemeral disks.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't IMDS indicate whether one is attached or not in the metadata? I understand that's not directly under your control. But the platform is in the position where it knows whether one was attached or not and it should expose that to the instance such that cloud-init can Do The Right Thing(tm).

If we drop the timeout to something lower there are a class of users with Ephemeral disk which will get errors in the log about not waiting long enough for the disk to show up. Keeping it as it is means new instance-types without them have this long timeout but that does not regress other instance types. Including the vmSize check in cloud-init seems like a reasonable compromise, and it can be updated in cloud-init.

Looking at the function address_ephemeral_resize; in the case where we don't wait long enough there are some paths that will break, for example this:

https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1611074

Not waiting would mean cloud-init would fail to resize/reformat the ephemeral disk

Would it be reasonable to pre-populate those instances with user-data (ideally this is the case for vendor-data) via the UI or a template in cli that'd indicate no ephemeral disk is attached? This could be useful on instances which do have ephemeral disk but users don't want to it configured during firstboot (save boot time w.r.t partition, format,).

#cloud-config 
datasource:
   Azure:
      ephemeral_disk:
         enabled: true|false

user-data is processed before cloud-init/cmd/main.py calls the ds.activate() method which is what triggers the ephemeral resize. Then DatasourceAzure.activate() can skip the call to address_ephemeral_resize() altogether.

Copy link
Copy Markdown
Contributor Author

@johnsonshi johnsonshi Dec 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raharper//@anhvoms I definitely agree with you that IMDS/the platform should expose information on whether there is an ephemeral resource disk attached or not. Unfortunately, this platform-side change can't happen and be rolled out until >6 months from now. This means that for >half a year until platform changes are made, Linux VMs deployed on Azure will regress performance by 2 minutes.

There are reasons why decreasing the wait time for ephemeral disks won't regress current existing VM SKUs:

  1. For VM SKUs with ephemeral disks, the Azure platform guarantees that the ephemeral disk is attached before a VM is booted.

  2. In the past few years, we've never seen an instance where cloud-init had to wait for ephemeral disks to come up, as the disk was already attached (guaranteed by the platform) and the disk symlink was already created (created by udev rules).

  3. As mentioned in my PR description above, I've performed deployment tests across a variety of Linux images on Azure. The intent was to test whether any distros or images had delays in creating the disk symlink.

grep "Azure ephemeral disk: All files appeared after" /var/log/cloud-init.log
util.py[DEBUG]: Azure ephemeral disk: All files appeared after 0 seconds: ['/dev/disk/cloud/azure_resource']
The statement above is true and tested for the following images:
** RedHat:RHEL:8.2:latest
** SUSE:sles-15-sp2:gen2:latest
** Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest
** Canonical:0001-com-ubuntu-server-focal:20_04-lts-gen2:latest
** Canonical:UbuntuServer:18.04-LTS:latest
** Canonical:UbuntuServer:18_04-lts-gen2:latest
** RedHat:RHEL:7-LVM:7.9.2020111202
** RedHat:RHEL:7lvm-gen2:7.9.2020111205
** RedHat:RHEL:7.8:7.8.2020111309
** RedHat:RHEL:79-gen2:7.9.2020111302
** RedHat:RHEL:7_9:7.9.2020111301
** RedHat:RHEL:8-LVM:8.3.2020111909
** RedHat:RHEL:8-lvm-gen2:8.3.2020111910

  1. Because (1) platform guarantees disk to be attached before booting and (2) udev rules are loaded very early in the boot stage (from my understanding, as soon as systemd-udevd is loaded, which is very early in boot and way before cloud-init is loaded by systemd), there will be almost 0 chance for this change to regress existing VMs.

Ultimately, this change (1) prevents a 2-minute performance delay for all Linux VMs without ephemeral disks on Azure (which is a pretty significant perf penalty for a wide range of users) and (2) does not really regress something that hasn't happened before/has a very low chance of ever happening (very small/non-existent users affected).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. For VM SKUs with ephemeral disks, the Azure platform guarantees that the ephemeral disk is attached before a VM is booted.

I apologize for not reading your preamble in this PR more closely. If this is what the platform is guaranteeing then it seems reasonable to remove the timeout completely. If the disk is not there waiting 5 seconds isn't going to make it show up (rather something more invasive like running a blkid command to probe the storage layer would likely be needed).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries and thanks @raharper! This info definitely needs to be exposed through IMDS (as a 5-second wait is still a significant penalty especially for mass VM scale outs). After platform support is there (in a few months), we can programmatically decide whether to even wait for a disk or not.

is_new_instance=False, preserve_ntfs=False):
# wait for ephemeral disk to come up
naplen = .2
Expand All @@ -1470,7 +1470,7 @@ def address_ephemeral_resize(devpath=RESOURCE_DISK_PATH, maxwait=120,
report_diagnostic_event(
"ephemeral device '%s' did not appear after %d seconds." %
(devpath, maxwait),
logger_func=LOG.warning)
logger_func=LOG.debug)
return

result = False
Expand Down