Skip to content

Decrease azure ephemeral resource disk wait time#800

Merged
TheRealFalcon merged 9 commits into
canonical:masterfrom
johnsonshi:decrease-azure-ephemeral-resource-disk-wait-time
Feb 22, 2021
Merged

Decrease azure ephemeral resource disk wait time#800
TheRealFalcon merged 9 commits into
canonical:masterfrom
johnsonshi:decrease-azure-ephemeral-resource-disk-wait-time

Conversation

@johnsonshi
Copy link
Copy Markdown
Contributor

Proposed Commit Message

Azure: Support for VMs without ephemeral resource disks.

Changes:

  • Only merge in default Azure cloud ephemeral disk configs
    during DataSourceAzure._get_data() if the ephemeral disk
    exists.
  • DataSourceAzure.address_ephemeral_resize() (which is
    invoked in DataSourceAzure.activate() should only set up
    the ephemeral disk if the disk exists.

Azure VMs may or may not come with ephemeral resource disks
depending on the VM SKU. For VM SKUs that come with
ephemeral resource disks, the Azure platform guarantees that
the ephemeral resource disk is attached to the VM before
the VM is booted. For VM SKUs that do not come with
ephemeral resource disks, cloud-init currently attempts
to wait and set up a non-existent ephemeral resource
disk, which wastes boot time. It also causes disk setup
modules to fail (due to non-existent references to the
ephemeral resource disk).

udevadm settle is invoked by cloud-init very early in boot.
udevadm settle is invoked very early, before
DataSourceAzure's _get_data() and activate() methods.

Within DataSourceAzure's _get_data() and activate() methods,
the ephemeral resource disk path should exist if the
VM SKU comes with an ephemeral resource disk.
The ephemeral resource disk path should not exist if the
VM SKU does not come with an ephemeral resource disk.

LP: #1901011

Additional Context

Problem

For Azure VMs, cloud-init's DataSourceAzure.py formats and addresses ephemeral disk resizing. It does this for all Azure VM SKUs. See code and code.

The code right now waits up to 120 seconds for the ephemeral disk to appear before either proceeding or giving up. It waits up to 120 secs for the symlink /dev/disk/cloud/azure_resource to appear.

For new Azure VM SKUs (such as Dv4, Dsv4, Ev4, Esv4) that do not come with ephemeral resource disks, cloud-init would wait up to 120 seconds before giving up. See LP: #1901011.

For these new Azure VM SKUs without ephemeral resource disks, the disk_setup module would also fail later in cloud-init because "builtin Azure ephemeral disk configs" are merged into DataSourceAzure metadata. These builtin configs reference the non-existent ephemeral disk, which causes the module to fail.

Why this approach was chosen

As of today, the Azure Instance Metadata Service (Azure IMDS) does not expose VM instance metadata indicating whether an ephemeral resource disk exists for the VM or not.

The Azure host also guarantees that the ephemeral resource disk is attached to the VM before it is booted during VM deployment.

Additionally, the ephemeral resource disk symlink (/dev/disk/cloud/azure_resource) that cloud-init waits for is actually created by a udev rule that comes with cloud-init. Additional relevant code, code, and code.

Because:

  • the Azure host guarantees that resource disks are attached for VM SKUs that have resource disks,
  • because the symlinks are created by udev rules (created as soon as the kernel detects the disk and sends the event to udev),
  • and because udevadm settle is invoked very early in boot by cloud-init (before DataSourceAzure runs),
    it is guaranteed that the ephemeral resource disk symlink exists by the time DataSourceAzure runs.

Test Steps

No regression for VM SKUs with ephemeral resource disk

  • Created a custom image with this branch's cloud-init installed.
  • Deployed a Standard_DS1_V2 VM (has ephemeral resource disk) from this custom image.
  • Cloud-init logs here: https://paste.ubuntu.com/p/qdxkfwYKRD/
2021-02-03 23:23:31,913 - __init__.py[DEBUG]: Found unstable nic names: ['eth0']; calling udevadm settle
2021-02-03 23:23:31,914 - subp.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2021-02-03 23:23:31,933 - util.py[DEBUG]: Waiting for udev events to settle took 0.019 seconds
...
2021-02-03 23:23:32,823 - handlers.py[DEBUG]: finish: azure-ds/crawl_metadata: SUCCESS: crawl_metadata
2021-02-03 23:23:32,823 - util.py[DEBUG]: Crawl of metadata service took 1.117 seconds
...
2021-02-03 23:23:32,824 - azure.py[DEBUG]: Ephemeral resource disk '/dev/disk/cloud/azure_resource' exists. Merging default Azure cloud ephemeral disk configs.
...
2021-02-03 23:23:38,563 - handlers.py[DEBUG]: start: azure-ds/activate: activate
2021-02-03 23:23:38,563 - handlers.py[DEBUG]: start: azure-ds/address_ephemeral_resize: address_ephemeral_resize
2021-02-03 23:23:38,564 - azure.py[DEBUG]: Ephemeral resource disk '/dev/disk/cloud/azure_resource' exists.
...
2021-02-03 23:23:38,894 - handlers.py[DEBUG]: finish: azure-ds/address_ephemeral_resize: SUCCESS: address_ephemeral_resize

Fix for VM SKUs without ephemeral resource disk

  • Created a custom image with this branch's cloud-init installed.
  • Deployed a Standard_D2s_v4 VM (no ephemeral resource disk) from this custom image.
  • Cloud-init logs here: https://paste.ubuntu.com/p/GmCqBNTQrG/
2021-02-03 23:20:53,217 - __init__.py[DEBUG]: Found unstable nic names: ['eth0']; calling udevadm settle
2021-02-03 23:20:53,217 - subp.py[DEBUG]: Running command ['udevadm', 'settle'] with allowed return codes [0] (shell=False, capture=True)
2021-02-03 23:20:53,235 - util.py[DEBUG]: Waiting for udev events to settle took 0.018 seconds
...
2021-02-03 23:20:54,388 - handlers.py[DEBUG]: finish: azure-ds/crawl_metadata: SUCCESS: crawl_metadata
2021-02-03 23:20:54,389 - util.py[DEBUG]: Crawl of metadata service took 1.363 seconds
...
2021-02-03 23:20:54,389 - azure.py[DEBUG]: Ephemeral resource disk '/dev/disk/cloud/azure_resource' does not exist. Not merging default Azure cloud ephemeral disk configs.
...
2021-02-03 23:20:59,846 - handlers.py[DEBUG]: start: azure-ds/activate: activate
2021-02-03 23:20:59,846 - handlers.py[DEBUG]: start: azure-ds/address_ephemeral_resize: address_ephemeral_resize
2021-02-03 23:20:59,847 - azure.py[DEBUG]: Ephemeral resource disk '/dev/disk/cloud/azure_resource' does not exist.
2021-02-03 23:20:59,847 - handlers.py[DEBUG]: finish: azure-ds/address_ephemeral_resize: SUCCESS: address_ephemeral_resize

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@johnsonshi
Copy link
Copy Markdown
Contributor Author

@raharper @smoser @anhvoms The previous PR was closed due to inactivity as I was away for a short while. I have reopened it with the suggested changes. Dropped the wait entirely as the platform guarantees the existence of the resource disk for VMs that come with them, and udevadm settle is invoked anyways by cloud-init early in boot.

Also updated the PR message above with new info.

Copy link
Copy Markdown
Collaborator

@smoser smoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the unfortunate situation of not being able to know if we should expect a given disk or not, I think this is this approach is the best we can do.

My only concern is that we'll only know of failures when a user reports a problem, and they're not even likely to immediately notice a problem.

I guess the way to address that would be in cloud-tests. The test would be able to know (based on the requested instance-type or details) if it expected a disk or not. It could launch one of each and expect to find the disk or expect to have no WARN/errors in the log about one not being there.

@OddBloke or @TheRealFalcon is something like that possible?

@TheRealFalcon
Copy link
Copy Markdown
Contributor

Yes, I think something like that shouldn't be too hard.

@TheRealFalcon
Copy link
Copy Markdown
Contributor

@johnsonshi I created an example integration test that could be used/adapted here (note that is also requires the latest pycloudlib):

https://github.com/TheRealFalcon/cloud-init/blob/ephemeral-test/tests/integration_tests/bugs/test_lp1901011.py

@TheRealFalcon TheRealFalcon force-pushed the decrease-azure-ephemeral-resource-disk-wait-time branch from 776844d to 01df49c Compare February 22, 2021 19:31
Comment on lines +42 to +46
assert 'azure_root -> ../../sda' in dev_links
assert 'azure_root-part1 -> ../../sda1' in dev_links
if is_ephemeral:
assert 'azure_resource -> ../../sdb' in dev_links
assert 'azure_resource-part1 -> ../../sdb1' in dev_links
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not certain, but I think the reason for these symlinks existing is that the underlying device names are not necessarily stable. I think this check is still good if you just s/ ->.*// each of the strings?

Comment on lines +50 to +52
assert re.search('sda1.*part /', blks) is not None
if is_ephemeral:
assert re.search('sdb1.*part /mnt', blks) is not None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the above, probably want to dereference the named symlinks for these disks and check for those?

Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test LGTM, and passes locally. Thanks!

@TheRealFalcon TheRealFalcon force-pushed the decrease-azure-ephemeral-resource-disk-wait-time branch from cf3bb5f to 1d25e98 Compare February 22, 2021 21:41
@TheRealFalcon TheRealFalcon merged commit a64b738 into canonical:master Feb 22, 2021
@johnsonshi johnsonshi deleted the decrease-azure-ephemeral-resource-disk-wait-time branch February 23, 2021 19:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants