Decrease azure ephemeral resource disk wait time#800
Conversation
|
@raharper @smoser @anhvoms The previous PR was closed due to inactivity as I was away for a short while. I have reopened it with the suggested changes. Dropped the wait entirely as the platform guarantees the existence of the resource disk for VMs that come with them, and Also updated the PR message above with new info. |
smoser
left a comment
There was a problem hiding this comment.
Given the unfortunate situation of not being able to know if we should expect a given disk or not, I think this is this approach is the best we can do.
My only concern is that we'll only know of failures when a user reports a problem, and they're not even likely to immediately notice a problem.
I guess the way to address that would be in cloud-tests. The test would be able to know (based on the requested instance-type or details) if it expected a disk or not. It could launch one of each and expect to find the disk or expect to have no WARN/errors in the log about one not being there.
@OddBloke or @TheRealFalcon is something like that possible?
|
Yes, I think something like that shouldn't be too hard. |
|
@johnsonshi I created an example integration test that could be used/adapted here (note that is also requires the latest pycloudlib): |
776844d to
01df49c
Compare
| assert 'azure_root -> ../../sda' in dev_links | ||
| assert 'azure_root-part1 -> ../../sda1' in dev_links | ||
| if is_ephemeral: | ||
| assert 'azure_resource -> ../../sdb' in dev_links | ||
| assert 'azure_resource-part1 -> ../../sdb1' in dev_links |
There was a problem hiding this comment.
I'm not certain, but I think the reason for these symlinks existing is that the underlying device names are not necessarily stable. I think this check is still good if you just s/ ->.*// each of the strings?
| assert re.search('sda1.*part /', blks) is not None | ||
| if is_ephemeral: | ||
| assert re.search('sdb1.*part /mnt', blks) is not None |
There was a problem hiding this comment.
Given the above, probably want to dereference the named symlinks for these disks and check for those?
OddBloke
left a comment
There was a problem hiding this comment.
Integration test LGTM, and passes locally. Thanks!
cf3bb5f to
1d25e98
Compare
Proposed Commit Message
Azure: Support for VMs without ephemeral resource disks.
Changes:
during
DataSourceAzure._get_data()if the ephemeral diskexists.
DataSourceAzure.address_ephemeral_resize()(which isinvoked in
DataSourceAzure.activate()should only set upthe ephemeral disk if the disk exists.
Azure VMs may or may not come with ephemeral resource disks
depending on the VM SKU. For VM SKUs that come with
ephemeral resource disks, the Azure platform guarantees that
the ephemeral resource disk is attached to the VM before
the VM is booted. For VM SKUs that do not come with
ephemeral resource disks, cloud-init currently attempts
to wait and set up a non-existent ephemeral resource
disk, which wastes boot time. It also causes disk setup
modules to fail (due to non-existent references to the
ephemeral resource disk).
udevadm settleis invoked by cloud-init very early in boot.udevadm settleis invoked very early, beforeDataSourceAzure's_get_data()andactivate()methods.Within
DataSourceAzure's_get_data()andactivate()methods,the ephemeral resource disk path should exist if the
VM SKU comes with an ephemeral resource disk.
The ephemeral resource disk path should not exist if the
VM SKU does not come with an ephemeral resource disk.
LP: #1901011
Additional Context
Problem
For Azure VMs, cloud-init's
DataSourceAzure.pyformats and addresses ephemeral disk resizing. It does this for all Azure VM SKUs. See code and code.The code right now waits up to 120 seconds for the ephemeral disk to appear before either proceeding or giving up. It waits up to 120 secs for the symlink
/dev/disk/cloud/azure_resourceto appear.For new Azure VM SKUs (such as
Dv4,Dsv4,Ev4,Esv4) that do not come with ephemeral resource disks, cloud-init would wait up to 120 seconds before giving up. See LP: #1901011.For these new Azure VM SKUs without ephemeral resource disks, the
disk_setupmodule would also fail later in cloud-init because "builtin Azure ephemeral disk configs" are merged into DataSourceAzure metadata. These builtin configs reference the non-existent ephemeral disk, which causes the module to fail.Why this approach was chosen
As of today, the Azure Instance Metadata Service (Azure IMDS) does not expose VM instance metadata indicating whether an ephemeral resource disk exists for the VM or not.
The Azure host also guarantees that the ephemeral resource disk is attached to the VM before it is booted during VM deployment.
Additionally, the ephemeral resource disk symlink (
/dev/disk/cloud/azure_resource) that cloud-init waits for is actually created by a udev rule that comes with cloud-init. Additional relevant code, code, and code.Because:
udevrules (created as soon as the kernel detects the disk and sends the event toudev),udevadm settleis invoked very early in boot by cloud-init (before DataSourceAzure runs),it is guaranteed that the ephemeral resource disk symlink exists by the time DataSourceAzure runs.
Test Steps
No regression for VM SKUs with ephemeral resource disk
Standard_DS1_V2VM (has ephemeral resource disk) from this custom image.Fix for VM SKUs without ephemeral resource disk
Standard_D2s_v4VM (no ephemeral resource disk) from this custom image.Checklist: