azure: case-insensitive UUID to avoid new IID during kernel upgrade#798
Conversation
TheRealFalcon
left a comment
There was a problem hiding this comment.
Only looked at the integration test. Can do fuller review tomorrow.
OddBloke
left a comment
There was a problem hiding this comment.
Some quick integration test thoughts.
ba8e071 to
bf190cd
Compare
bf190cd to
5c199e9
Compare
OddBloke
left a comment
There was a problem hiding this comment.
Just integration test review for now. I've left some specific comments, but I'd like us to step back and consider whether this is the right path for this test. Previously you were launching an image that would present with one UUID case, and then upgrading its kernel in a way that would present the other case. I suggested an iteration of that approach: using an old standard Ubuntu image to avoid the dependency on Pro images. I think it's likely that would give us a simpler test (with fewer external dependencies); did you try that approach before settling on this one?
|
|
||
| def _check_iid_insensitive_across_kernel_upgrade(client): | ||
| uuid = client.read_from_file('/sys/class/dmi/id/product_uuid') | ||
| assert uuid.islower(), "UUID does not appear to be lowercase {}".format( |
There was a problem hiding this comment.
It would be good to indicate (either in the message or in a comment) that this is a precondition assertion, rather than an assertion on what cloud-init should have done. Let's do this for other such assertions too.
| uuid = client.read_from_file('/sys/class/dmi/id/product_uuid') | ||
| assert uuid.isupper(), "UUID does not appear to be uppercase {}".format( | ||
| uuid | ||
| ) |
There was a problem hiding this comment.
Should we check that the UUID hasn't changed (and therefore invalidated our preconditions)?
| case-insensitive comparison and avoid triggering "new instance" detection | ||
| in cloud-init on Azure platform. | ||
|
|
||
| The test will launch an Ubuntu image which has a 5.4 kernel. We allow |
There was a problem hiding this comment.
This isn't true: this test is not constrained by Ubuntu release, so will launch whatever version of the kernel is in the latest Azure image for the target Ubuntu release.
There was a problem hiding this comment.
We've now pinned the test to launch a specific Ubuntu PRO FIPS image that will be available forever and exhibits a UUID case change when upgrading to linux-azure from linux-azure-fips.
OddBloke
left a comment
There was a problem hiding this comment.
OK, and now some not-integration-test review.
5566f9b to
a38eb71
Compare
OddBloke
left a comment
There was a problem hiding this comment.
Code, unit tests and general structure of the integration test looks good to me, thank you!
We still need to get this test to run only on bionic, see my inline thoughts.
|
|
||
| @pytest.mark.azure | ||
| @pytest.mark.sru_next | ||
| @pytest.mark.not_xenial |
There was a problem hiding this comment.
This will still run on every release bionic+; we could introduce an only_bionic mark or similar. However, given that this test doesn't use the client fixtures, we don't actually need to perform the skip via marks to avoid the instance launch: we can do it in the body of the test before we call launch instead.
Ideally, we'd be able to do something like if (session_cloud.image.os, session_cloud.image.release) != ("ubuntu", "bionic"): pytest.skip(...) but we don't currently store the ImageSpecification that we determine on session_cloud. That's probably not too painful to hook up (though may require some thought around snapshotting), but I don't think we need to for this, re-parsing OS_IMAGE should be sufficient: something like image_spec = ImageSpecification.from_os_image(); if (image_spec.os, image_spec.release) != ("ubuntu", "bionic"): pytest.skip(...).
There was a problem hiding this comment.
@OddBloke funny, I put that type of pytest.skip in locally and decided to pullt it out based on looking at self.settings.os_image I think. I'll put some semblance of that back into this PR
a38eb71 to
65b2311
Compare
OddBloke
left a comment
There was a problem hiding this comment.
Thanks for the revisions, a few more thoughts inline.
One thing to note, if I try to run this test locally, I get:
E msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'
I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.
| # work around pad.lv/1908287 | ||
| instance.restart() | ||
| if not instance.execute("cloud-init status --wait --long").ok: |
There was a problem hiding this comment.
I'm not 100% sure this will workaround the issue: this first execute call could still get into the instance before it reboots (which is the root cause of the issue) and therefore not trigger any of the waiting behaviour.
There was a problem hiding this comment.
(Please see my below comment on where we call this before spending too much time on these comments. 👍)
| for _ in range(10): | ||
| time.sleep(5) | ||
| result = instance.execute("cloud-init status --wait --long") | ||
| if result.ok: |
There was a problem hiding this comment.
If result.ok is False, what is gained by retrying? --wait means that we know that's the final status we're going to hit, so I think these retries will just run cloud-init status 4 additional times before raising the below exception.
| # Allow for upgrade of cloud-init in FIPS image for pre-release testing | ||
| if install_cloudinit: | ||
| instance.install_new_cloud_init(source, take_snapshot=False) | ||
| _restart_with_retries(instance) |
There was a problem hiding this comment.
Given that we'll restart the instance for the new kernel, do we need to restart here as well?
There was a problem hiding this comment.
The reason we restart after upgrading cloud-init is because install_new_cloud_init calls instance.clean() which removes all cloud-init logs, semaphores and syslog though pycloudlib. What we are trying to establish is that the upgraded cloud-init detects uppercase UUID on first boot, then on 2nd boot into new kernel we don't accidentally detect a change in instance-id due to the now lowercase UUID. I support instead of the reboot I could either avoid instance.clean() in install_new_cloud_init by providing a clean=False param. That'd save time on the reboot (because cloud-init will have already detected the uppercase instance-id even before we upgrade cloud-init on that instance.
There was a problem hiding this comment.
I've pushed a change to avoid the extra-reboot and I'm testing this now.
There was a problem hiding this comment.
Yeah, I'd support modifying install_new_cloud_init to allow for this case; a clean parameter sounds reasonable to me.
@TheRealFalcon How does that sound to you?
There was a problem hiding this comment.
added clean param and confirmed success run of the following (which a built deb from this PR)
CLOUD_INIT_OS_IMAGE=bionic CLOUD_INIT_CLOUD_INIT_SOURCE=/cloud-init_20.4.1-86-g5c199e90-1~bddeb_all.deb CLOUD_INIT_PUBLIC_SSH_KEY=/mykey CLOUD_INIT_PLATFORM=azure .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py65b2311 to
6396c3c
Compare
To follow-up However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which |
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a lowercase value. Peviously UUID as uppercase. Azure datasource reads the product_uuid directly as their platform's instance-id. This present a problem if a kernel is either upgraded or downgraded across the 4.15 kernel version boundary because the case of the UUID will change, resulting in cloud-init seeing a "new" instance id and re-running all modules. This causes ssh host keys to get regenerated across reboot into the new kernels which will cause concern on long-running instances that something nefarious has happened. LP: #1835584
37c7fbe to
6c738da
Compare
Yes this a good point, I was thinking for current dev it gives us the change to run the test to see the failure currently without the image_source set. But, we probably don't want to instrument tests that we can easily cause to break. Reviewers could just remove such a pytest.skip if they wanted to see failures. |
6c738da to
2b83481
Compare
TheRealFalcon
left a comment
There was a problem hiding this comment.
Integration test LGTM. I did a much shallower review of everything else given how much Dan has been involved.
Proposed Commit Message
Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.
Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.
Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.
Also add:
a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
next SRU
LP: #1835584
Additional Context
Integration test will add a 4.14 generic kernel, add grub config to prefer the "generic" flavor of that kernel across reboot.
This triggers the uppercase product_uuid which on current cloud-init would generate new ssh host keys.
New cloud-init will not trigger this ssh host key creation.
Test Steps
Checklist: