Skip to content

WIP: Track boot id on each boot#92

Closed
TheRealFalcon wants to merge 1 commit into
masterfrom
wait-for-boot
Closed

WIP: Track boot id on each boot#92
TheRealFalcon wants to merge 1 commit into
masterfrom
wait-for-boot

Conversation

@TheRealFalcon
Copy link
Copy Markdown
Contributor

When we signal reboot and attempt to reconnect, we can wind up
reconnecting before the reboot ever took place. This change updates the boot
process to ensure we've obtained a new boot id before proceeding.

Currently WIP. Haven't checked tests or checked anything other than the simple ec2 happy path yet.

When we signal reboot and attempt to reconnect, we can wind up
reconnecting before the reboot ever took place. This change updates the boot
process to ensure we've obtained a new boot id before proceeding.
@TheRealFalcon TheRealFalcon changed the title Track boot id on each boot. WIP: Track boot id on each boot Jan 13, 2021
Copy link
Copy Markdown
Contributor

@OddBloke OddBloke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this POC, James, it's made having a concrete conversation much easier!

(From the now-Invalid cloud-init bug here, so we aren't discussing in two places):

One problem is that on a restart we're not going to get any status change from these cloud APIs (e.g., always RUNNING or something like it), so I'm not sure any of the "_wait_for_instance_start" methods buy us anything of value, at least on restart.

Agreed, these can at best be used as a guard to save us from attempting to connect before there's any chance of it working.

One open question I have in this area: should we be using the cloud's mechanisms for most of these restarts (which, at best, are ACPI events, but may amount to a hard power-cycle on some clouds), or should we be executing reboot in the running system?

Looking at GCE's implementation, for example, it amounts to an ACPI shutdown (via their API), then a regular boot. Speaking from personal experience, this is not entirely the same as a reboot: my desktop will reboot when I reboot it, but it doesn't actually ever turn off if I halt it: I have to hold the power button for 4s. I'm not worried about this exact behaviour in GCE, of course, but it does indicate that GceInstance.restart (ACPI halt + power-on) and EC2Instance.restart (ACPI reboot) may behave differently to one another in subtle ways.

Thoughts?

Comment thread pycloudlib/instance.py
result = self.execute(test_instance_command)

if result.ok:
boot_id_command = "cat /proc/sys/kernel/random/boot_id"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this further, part of the reason whoami is used here is that it's present on pretty much any Unix system: failures to run it almost certainly mean that we don't have access to the system. This replacement will never succeed on non-Linux systems, such as *BSD, where we'll block for 5 minutes and then fail.

We also won't have self.boot_id set on instances which aren't wait'd for at all, and it will be stale on subsequent boots of instances which are rebooted by means other than pycloudlib: cc_power_state or even a simple .execute("sudo reboot"). This will make using it tricky: it won't necessarily be present when we restart, so we'd have to determine it at that point regardless. I wonder if this indicates that instead of storing it at the instance-level we should contain it to the restart path, something like: determine current boot ID, trigger a restart, then call self.wait in a loop which we break out of once the post-wait boot ID doesn't match?

(As an aside, we should consider if whatever we do here allows us to drop (or substantially simplify) the reboot detection code in test_power_state_change in cloud-init; if we can't readily use it there, then that may mean we have not found the right abstraction.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it might be possibly useful in other ways of checking boot too, but you're right in that it'd be a lot more work to keep the boot id current. So I agree, let's restrict it to restart.

Your parenthetical is right too. The boot detecting code was initially added there because of this, so a good test of this should be to remove that code.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

The ec2 case, which the original bug was based off of, was fixed as a result of #93. There was already code in the ec2 restart path to track boot ids, but we weren't hitting it in cloud-init because we were always setting wait=False, and then doing our own wait after the call.

Since we've determined we don't want a global boot id, this PR is no longer relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants