WIP: Track boot id on each boot by TheRealFalcon · Pull Request #92 · canonical/pycloudlib

TheRealFalcon · 2021-01-13T16:36:09Z

When we signal reboot and attempt to reconnect, we can wind up
reconnecting before the reboot ever took place. This change updates the boot
process to ensure we've obtained a new boot id before proceeding.

Currently WIP. Haven't checked tests or checked anything other than the simple ec2 happy path yet.

When we signal reboot and attempt to reconnect, we can wind up reconnecting before the reboot ever took place. This change updates the boot process to ensure we've obtained a new boot id before proceeding.

OddBloke

Thanks for this POC, James, it's made having a concrete conversation much easier!

(From the now-Invalid cloud-init bug here, so we aren't discussing in two places):

One problem is that on a restart we're not going to get any status change from these cloud APIs (e.g., always RUNNING or something like it), so I'm not sure any of the "_wait_for_instance_start" methods buy us anything of value, at least on restart.

Agreed, these can at best be used as a guard to save us from attempting to connect before there's any chance of it working.

One open question I have in this area: should we be using the cloud's mechanisms for most of these restarts (which, at best, are ACPI events, but may amount to a hard power-cycle on some clouds), or should we be executing reboot in the running system?

Looking at GCE's implementation, for example, it amounts to an ACPI shutdown (via their API), then a regular boot. Speaking from personal experience, this is not entirely the same as a reboot: my desktop will reboot when I reboot it, but it doesn't actually ever turn off if I halt it: I have to hold the power button for 4s. I'm not worried about this exact behaviour in GCE, of course, but it does indicate that GceInstance.restart (ACPI halt + power-on) and EC2Instance.restart (ACPI reboot) may behave differently to one another in subtle ways.

Thoughts?

OddBloke · 2021-01-13T18:21:50Z

-                result = self.execute(test_instance_command)
-
-                if result.ok:
+        boot_id_command = "cat /proc/sys/kernel/random/boot_id"


Thinking about this further, part of the reason whoami is used here is that it's present on pretty much any Unix system: failures to run it almost certainly mean that we don't have access to the system. This replacement will never succeed on non-Linux systems, such as *BSD, where we'll block for 5 minutes and then fail.

We also won't have self.boot_id set on instances which aren't wait'd for at all, and it will be stale on subsequent boots of instances which are rebooted by means other than pycloudlib: cc_power_state or even a simple .execute("sudo reboot"). This will make using it tricky: it won't necessarily be present when we restart, so we'd have to determine it at that point regardless. I wonder if this indicates that instead of storing it at the instance-level we should contain it to the restart path, something like: determine current boot ID, trigger a restart, then call self.wait in a loop which we break out of once the post-wait boot ID doesn't match?

(As an aside, we should consider if whatever we do here allows us to drop (or substantially simplify) the reboot detection code in test_power_state_change in cloud-init; if we can't readily use it there, then that may mean we have not found the right abstraction.)

I thought it might be possibly useful in other ways of checking boot too, but you're right in that it'd be a lot more work to keep the boot id current. So I agree, let's restrict it to restart.

Your parenthetical is right too. The boot detecting code was initially added there because of this, so a good test of this should be to remove that code.

TheRealFalcon · 2021-02-09T16:45:13Z

The ec2 case, which the original bug was based off of, was fixed as a result of #93. There was already code in the ec2 restart path to track boot ids, but we weren't hitting it in cloud-init because we were always setting wait=False, and then doing our own wait after the call.

Since we've determined we don't want a global boot id, this PR is no longer relevant.

Track boot id on each boot.

26915dc

When we signal reboot and attempt to reconnect, we can wind up reconnecting before the reboot ever took place. This change updates the boot process to ensure we've obtained a new boot id before proceeding.

TheRealFalcon added the wip label Jan 13, 2021

TheRealFalcon changed the title ~~Track boot id on each boot.~~ WIP: Track boot id on each boot Jan 13, 2021

OddBloke reviewed Jan 13, 2021

View reviewed changes

TheRealFalcon mentioned this pull request Jan 14, 2021

Remove raise_on_failure from wait calls #93

Merged

TheRealFalcon closed this Feb 9, 2021

TheRealFalcon deleted the wait-for-boot branch February 9, 2021 18:02

ubuntu-server-builder mentioned this pull request May 12, 2023

Wrong access permissions of authorized keys file and parent directory when using absolute AuthorizedKeysFile canonical/cloud-init#3829

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Track boot id on each boot#92

WIP: Track boot id on each boot#92
TheRealFalcon wants to merge 1 commit into
masterfrom
wait-for-boot

TheRealFalcon commented Jan 13, 2021

Uh oh!

OddBloke left a comment

Uh oh!

OddBloke Jan 13, 2021

Uh oh!

TheRealFalcon Jan 13, 2021

Uh oh!

TheRealFalcon commented Feb 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

TheRealFalcon commented Jan 13, 2021

Uh oh!

OddBloke left a comment

Choose a reason for hiding this comment

Uh oh!

OddBloke Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

TheRealFalcon Jan 13, 2021

Choose a reason for hiding this comment

Uh oh!

TheRealFalcon commented Feb 9, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants