WIP: Track boot id on each boot#92
Conversation
When we signal reboot and attempt to reconnect, we can wind up reconnecting before the reboot ever took place. This change updates the boot process to ensure we've obtained a new boot id before proceeding.
OddBloke
left a comment
There was a problem hiding this comment.
Thanks for this POC, James, it's made having a concrete conversation much easier!
(From the now-Invalid cloud-init bug here, so we aren't discussing in two places):
One problem is that on a restart we're not going to get any status change from these cloud APIs (e.g., always RUNNING or something like it), so I'm not sure any of the "_wait_for_instance_start" methods buy us anything of value, at least on restart.
Agreed, these can at best be used as a guard to save us from attempting to connect before there's any chance of it working.
One open question I have in this area: should we be using the cloud's mechanisms for most of these restarts (which, at best, are ACPI events, but may amount to a hard power-cycle on some clouds), or should we be executing reboot in the running system?
Looking at GCE's implementation, for example, it amounts to an ACPI shutdown (via their API), then a regular boot. Speaking from personal experience, this is not entirely the same as a reboot: my desktop will reboot when I reboot it, but it doesn't actually ever turn off if I halt it: I have to hold the power button for 4s. I'm not worried about this exact behaviour in GCE, of course, but it does indicate that GceInstance.restart (ACPI halt + power-on) and EC2Instance.restart (ACPI reboot) may behave differently to one another in subtle ways.
Thoughts?
| result = self.execute(test_instance_command) | ||
|
|
||
| if result.ok: | ||
| boot_id_command = "cat /proc/sys/kernel/random/boot_id" |
There was a problem hiding this comment.
Thinking about this further, part of the reason whoami is used here is that it's present on pretty much any Unix system: failures to run it almost certainly mean that we don't have access to the system. This replacement will never succeed on non-Linux systems, such as *BSD, where we'll block for 5 minutes and then fail.
We also won't have self.boot_id set on instances which aren't wait'd for at all, and it will be stale on subsequent boots of instances which are rebooted by means other than pycloudlib: cc_power_state or even a simple .execute("sudo reboot"). This will make using it tricky: it won't necessarily be present when we restart, so we'd have to determine it at that point regardless. I wonder if this indicates that instead of storing it at the instance-level we should contain it to the restart path, something like: determine current boot ID, trigger a restart, then call self.wait in a loop which we break out of once the post-wait boot ID doesn't match?
(As an aside, we should consider if whatever we do here allows us to drop (or substantially simplify) the reboot detection code in test_power_state_change in cloud-init; if we can't readily use it there, then that may mean we have not found the right abstraction.)
There was a problem hiding this comment.
I thought it might be possibly useful in other ways of checking boot too, but you're right in that it'd be a lot more work to keep the boot id current. So I agree, let's restrict it to restart.
Your parenthetical is right too. The boot detecting code was initially added there because of this, so a good test of this should be to remove that code.
|
The ec2 case, which the original bug was based off of, was fixed as a result of #93. There was already code in the ec2 Since we've determined we don't want a global boot id, this PR is no longer relevant. |
When we signal reboot and attempt to reconnect, we can wind up
reconnecting before the reboot ever took place. This change updates the boot
process to ensure we've obtained a new boot id before proceeding.
Currently WIP. Haven't checked tests or checked anything other than the simple ec2 happy path yet.