-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Azure: Support for VMs without ephemeral resource disks #716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
johnsonshi
wants to merge
3
commits into
canonical:master
from
johnsonshi:decrease-azure-ephemeral-resource-disk-wait-time
Closed
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there no way to know whether a VM is one that does not have ephemeral disks?
You mention specific types:
"Dv4, Dsv4, Ev4, Esv4"
Is the instance-type availabe in metadata? If that's available, then one could look up the maxwait value based on instance-type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if there is not a way to know ... can you fix the platform?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, IMDS support for exposing ephemeral resource disk presence/absence won't be around for quite some time/in the next few months. In the intervening time, Linux VMs deployed without the disks are suffering from a 2-minute delay with cloud-init.
I opened this draft PR ahead of time to communicate our plans:
** Optional: If the ephemeral disk doesn't come up, then delete the built-in Azure DS cloud-config that references setting up the ephemeral disk. This prevents
disk_setupandfs_setupfrom throwing RuntimeErrors due to referencing non-existent ephemeral disks.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asking if the instance type is available; IIUC, there are new instance types which are without the ephemeral disk;
The instance metadata has[1]:
"vmSize": "Standard_A3"
Can't we set the timeout value low if the vmSize is of the types without the disk?
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/instance-metadata-service
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't take a dependency on the list of VM Sizes since the list of VM Sizes without ephemeral disks will grow as time passes. This design also won't be resilient if Azure ever exposes an option (to users) to deploy VMs with/without ephemeral disks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't IMDS indicate whether one is attached or not in the metadata? I understand that's not directly under your control. But the platform is in the position where it knows whether one was attached or not and it should expose that to the instance such that cloud-init can Do The Right Thing(tm).
If we drop the timeout to something lower there are a class of users with Ephemeral disk which will get errors in the log about not waiting long enough for the disk to show up. Keeping it as it is means new instance-types without them have this long timeout but that does not regress other instance types. Including the vmSize check in cloud-init seems like a reasonable compromise, and it can be updated in cloud-init.
Looking at the function address_ephemeral_resize; in the case where we don't wait long enough there are some paths that will break, for example this:
https://bugs.launchpad.net/ubuntu/+source/cloud-init/+bug/1611074
Not waiting would mean cloud-init would fail to resize/reformat the ephemeral disk
Would it be reasonable to pre-populate those instances with user-data (ideally this is the case for vendor-data) via the UI or a template in cli that'd indicate no ephemeral disk is attached? This could be useful on instances which do have ephemeral disk but users don't want to it configured during firstboot (save boot time w.r.t partition, format,).
user-data is processed before cloud-init/cmd/main.py calls the ds.activate() method which is what triggers the ephemeral resize. Then DatasourceAzure.activate() can skip the call to address_ephemeral_resize() altogether.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raharper//@anhvoms I definitely agree with you that IMDS/the platform should expose information on whether there is an ephemeral resource disk attached or not. Unfortunately, this platform-side change can't happen and be rolled out until >6 months from now. This means that for >half a year until platform changes are made, Linux VMs deployed on Azure will regress performance by 2 minutes.
There are reasons why decreasing the wait time for ephemeral disks won't regress current existing VM SKUs:
For VM SKUs with ephemeral disks, the Azure platform guarantees that the ephemeral disk is attached before a VM is booted.
In the past few years, we've never seen an instance where cloud-init had to wait for ephemeral disks to come up, as the disk was already attached (guaranteed by the platform) and the disk symlink was already created (created by udev rules).
As mentioned in my PR description above, I've performed deployment tests across a variety of Linux images on Azure. The intent was to test whether any distros or images had delays in creating the disk symlink.
grep "Azure ephemeral disk: All files appeared after" /var/log/cloud-init.logutil.py[DEBUG]: Azure ephemeral disk: All files appeared after 0 seconds: ['/dev/disk/cloud/azure_resource']The statement above is true and tested for the following images:
**
RedHat:RHEL:8.2:latest**
SUSE:sles-15-sp2:gen2:latest**
Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest**
Canonical:0001-com-ubuntu-server-focal:20_04-lts-gen2:latest**
Canonical:UbuntuServer:18.04-LTS:latest**
Canonical:UbuntuServer:18_04-lts-gen2:latest**
RedHat:RHEL:7-LVM:7.9.2020111202**
RedHat:RHEL:7lvm-gen2:7.9.2020111205**
RedHat:RHEL:7.8:7.8.2020111309**
RedHat:RHEL:79-gen2:7.9.2020111302**
RedHat:RHEL:7_9:7.9.2020111301**
RedHat:RHEL:8-LVM:8.3.2020111909**
RedHat:RHEL:8-lvm-gen2:8.3.2020111910systemd-udevdis loaded, which is very early in boot and way beforecloud-initis loaded bysystemd), there will be almost 0 chance for this change to regress existing VMs.Ultimately, this change (1) prevents a 2-minute performance delay for all Linux VMs without ephemeral disks on Azure (which is a pretty significant perf penalty for a wide range of users) and (2) does not really regress something that hasn't happened before/has a very low chance of ever happening (very small/non-existent users affected).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I apologize for not reading your preamble in this PR more closely. If this is what the platform is guaranteeing then it seems reasonable to remove the timeout completely. If the disk is not there waiting 5 seconds isn't going to make it show up (rather something more invasive like running a blkid command to probe the storage layer would likely be needed).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries and thanks @raharper! This info definitely needs to be exposed through IMDS (as a 5-second wait is still a significant penalty especially for mass VM scale outs). After platform support is there (in a few months), we can programmatically decide whether to even wait for a disk or not.