Skip to content

azure: case-insensitive UUID to avoid new IID during kernel upgrade#798

Merged
blackboxsw merged 3 commits into
canonical:masterfrom
blackboxsw:ssh-revert-default-auth-keys
Feb 19, 2021
Merged

azure: case-insensitive UUID to avoid new IID during kernel upgrade#798
blackboxsw merged 3 commits into
canonical:masterfrom
blackboxsw:ssh-revert-default-auth-keys

Conversation

@blackboxsw
Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw commented Feb 1, 2021

Proposed Commit Message

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Previously UUID was uppercase.

Azure datasource reads the product_uuid directly as their platform's
instance-id. This presents a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.

Re-running cc_ssh in cloud-init deletes and regenerates ssh_host keys
on a system which can cause concern on long-running instances that
somethingnefarious has happened.

Also add:

  • An integration test for this for Azure Bionic Ubuntu FIPS upgrading from
    a FIPS kernel with uppercase UUID to a lowercase UUID in linux-azure
  • A new pytest.mark.sru_next to collect all integration tests related to our
    next SRU

LP: #1835584

Additional Context

Integration test will add a 4.14 generic kernel, add grub config to prefer the "generic" flavor of that kernel across reboot.
This triggers the uppercase product_uuid which on current cloud-init would generate new ssh host keys.
New cloud-init will not trigger this ssh host key creation.

Test Steps

# create a passwordless ssh pub/private keypair
ssh-keygen 


 CLOUD_INIT_PUBLIC_SSH_KEY=/root/.ssh/id_rsa_az.pub CLOUD_INIT_PLATFORM=azure  .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py
# Expect failure running against current cloud-init: 
# E       AssertionError: config_ssh ran too many times 2
# E       assert 1 == 2
# E         +1
# E         -2

make deb
# Expect success running using CLOUD_INIT_CLOUD_INIT_SOURCE=<deb_from_this_pr>

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • [n/a] I have updated or added any documentation accordingly

Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only looked at the integration test. Can do fuller review tomorrow.

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick integration test thoughts.

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch 3 times, most recently from ba8e071 to bf190cd Compare February 12, 2021 21:59
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch from bf190cd to 5c199e9 Compare February 12, 2021 22:34
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just integration test review for now. I've left some specific comments, but I'd like us to step back and consider whether this is the right path for this test. Previously you were launching an image that would present with one UUID case, and then upgrading its kernel in a way that would present the other case. I suggested an iteration of that approach: using an old standard Ubuntu image to avoid the dependency on Pro images. I think it's likely that would give us a simpler test (with fewer external dependencies); did you try that approach before settling on this one?


def _check_iid_insensitive_across_kernel_upgrade(client):
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
assert uuid.islower(), "UUID does not appear to be lowercase {}".format(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to indicate (either in the message or in a comment) that this is a precondition assertion, rather than an assertion on what cloud-init should have done. Let's do this for other such assertions too.

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment on lines +72 to +66
uuid = client.read_from_file('/sys/class/dmi/id/product_uuid')
assert uuid.isupper(), "UUID does not appear to be uppercase {}".format(
uuid
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we check that the UUID hasn't changed (and therefore invalidated our preconditions)?

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
case-insensitive comparison and avoid triggering "new instance" detection
in cloud-init on Azure platform.

The test will launch an Ubuntu image which has a 5.4 kernel. We allow
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true: this test is not constrained by Ubuntu release, so will launch whatever version of the kernel is in the latest Azure image for the target Ubuntu release.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've now pinned the test to launch a specific Ubuntu PRO FIPS image that will be available forever and exhibits a UUID case change when upgrading to linux-azure from linux-azure-fips.

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, and now some not-integration-test review.

Comment thread cloudinit/sources/DataSourceAzure.py
Comment thread tests/unittests/test_datasource/test_azure.py Outdated
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch 2 times, most recently from 5566f9b to a38eb71 Compare February 17, 2021 20:19
@blackboxsw blackboxsw requested a review from OddBloke February 17, 2021 20:28
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code, unit tests and general structure of the integration test looks good to me, thank you!

We still need to get this test to run only on bionic, see my inline thoughts.

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated

@pytest.mark.azure
@pytest.mark.sru_next
@pytest.mark.not_xenial
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still run on every release bionic+; we could introduce an only_bionic mark or similar. However, given that this test doesn't use the client fixtures, we don't actually need to perform the skip via marks to avoid the instance launch: we can do it in the body of the test before we call launch instead.

Ideally, we'd be able to do something like if (session_cloud.image.os, session_cloud.image.release) != ("ubuntu", "bionic"): pytest.skip(...) but we don't currently store the ImageSpecification that we determine on session_cloud. That's probably not too painful to hook up (though may require some thought around snapshotting), but I don't think we need to for this, re-parsing OS_IMAGE should be sufficient: something like image_spec = ImageSpecification.from_os_image(); if (image_spec.os, image_spec.release) != ("ubuntu", "bionic"): pytest.skip(...).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@OddBloke funny, I put that type of pytest.skip in locally and decided to pullt it out based on looking at self.settings.os_image I think. I'll put some semblance of that back into this PR

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch from a38eb71 to 65b2311 Compare February 18, 2021 02:13
Copy link
Copy Markdown
Collaborator

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions, a few more thoughts inline.

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

Comment on lines +48 to +50
# work around pad.lv/1908287
instance.restart()
if not instance.execute("cloud-init status --wait --long").ok:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure this will workaround the issue: this first execute call could still get into the instance before it reboots (which is the root cause of the issue) and therefore not trigger any of the waiting behaviour.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Please see my below comment on where we call this before spending too much time on these comments. 👍)

for _ in range(10):
time.sleep(5)
result = instance.execute("cloud-init status --wait --long")
if result.ok:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If result.ok is False, what is gained by retrying? --wait means that we know that's the final status we're going to hit, so I think these retries will just run cloud-init status 4 additional times before raising the below exception.

# Allow for upgrade of cloud-init in FIPS image for pre-release testing
if install_cloudinit:
instance.install_new_cloud_init(source, take_snapshot=False)
_restart_with_retries(instance)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we'll restart the instance for the new kernel, do we need to restart here as well?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we restart after upgrading cloud-init is because install_new_cloud_init calls instance.clean() which removes all cloud-init logs, semaphores and syslog though pycloudlib. What we are trying to establish is that the upgraded cloud-init detects uppercase UUID on first boot, then on 2nd boot into new kernel we don't accidentally detect a change in instance-id due to the now lowercase UUID. I support instead of the reboot I could either avoid instance.clean() in install_new_cloud_init by providing a clean=False param. That'd save time on the reboot (because cloud-init will have already detected the uppercase instance-id even before we upgrade cloud-init on that instance.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed a change to avoid the extra-reboot and I'm testing this now.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd support modifying install_new_cloud_init to allow for this case; a clean parameter sounds reasonable to me.

@TheRealFalcon How does that sound to you?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added clean param and confirmed success run of the following (which a built deb from this PR)

 CLOUD_INIT_OS_IMAGE=bionic CLOUD_INIT_CLOUD_INIT_SOURCE=/cloud-init_20.4.1-86-g5c199e90-1~bddeb_all.deb CLOUD_INIT_PUBLIC_SSH_KEY=/mykey  CLOUD_INIT_PLATFORM=azure  .tox/integration-tests/bin/py.test -v tests/integration_tests/bugs/test_lp1835584.py

Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
Comment thread tests/integration_tests/bugs/test_lp1835584.py Outdated
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 65b2311 to 6396c3c Compare February 18, 2021 20:14
@OddBloke
Copy link
Copy Markdown
Collaborator

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Kernel's newer than 4.15 present /sys/dmi/id/product_uuid as a
lowercase value. Peviously UUID as uppercase.

Azure datasource reads the product_uuid directly as their platform's
instance-id. This present a problem if a kernel is either
upgraded or downgraded across the 4.15 kernel version boundary because
the case of the UUID will change, resulting in cloud-init seeing a
"new" instance id and re-running all modules.

This causes ssh host keys to get regenerated across reboot into the new
kernels which will cause concern on long-running instances that something
nefarious has happened.

LP: #1835584
@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 37c7fbe to 6c738da Compare February 18, 2021 20:50
@blackboxsw
Copy link
Copy Markdown
Collaborator Author

One thing to note, if I try to run this test locally, I get:

E           msrestazure.azure_exceptions.CloudError: Azure Error: ResourcePurchaseValidationFailed
E           Message: User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '9a37cc2c-dc4a-4097-84dd-45dd2b8cbc63' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='canonical' offer = '0001-com-ubuntu-pro-bionic-fips', sku = 'pro-fips-18_04', Correlation Id: 'b0f49067-fae7-4c9f-b26c-8b7fd2eec9a0'.'

I'm going to follow those steps now, but this confirms that we shouldn't be running this test by default.

To follow-up az vm image terms accept --urn Canonical:0001-com-ubuntu-pro-bionic-fips:pro-fips-18_04:18.04.202010201 fixed this issue for me.

However, now that I've successfully run the test, it has failed. The cloud-init version in that image (intentionally) doesn't have the fix to make this test pass, which means this test can never pass without an updated cloud-init being installed. (This is different to our usual tests, which we would expect to fail until the Ubuntu image is updated with the fixed cloud-init: this test will never use an image with a fixed cloud-init.) Should we skip the test if we don't have a CLOUD_INIT_SOURCE which installs_new_version()?

Yes this a good point, I was thinking for current dev it gives us the change to run the test to see the failure currently without the image_source set. But, we probably don't want to instrument tests that we can easily cause to break. Reviewers could just remove such a pytest.skip if they wanted to see failures.

@blackboxsw blackboxsw force-pushed the ssh-revert-default-auth-keys branch from 6c738da to 2b83481 Compare February 18, 2021 21:01
Copy link
Copy Markdown
Contributor

@TheRealFalcon TheRealFalcon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integration test LGTM. I did a much shallower review of everything else given how much Dan has been involved.

@blackboxsw blackboxsw merged commit 66e2d42 into canonical:master Feb 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants