Remove raise_on_failure from wait calls by TheRealFalcon · Pull Request #93 · canonical/pycloudlib

TheRealFalcon · 2021-01-14T20:11:51Z

The 'raise_on_failure' aspect of waiting for cloud-init to start is
causing a fair amount of additional code and workarounds. It was
introduced as an optional parameter as to not break the existing
API, but given the amount of heartburn it's causing, it seems best to
remove it entirely. If one needs the original behavior, checking cloud-init
status via an execute call should suffice.

TheRealFalcon · 2021-01-14T20:20:34Z

@blackboxsw @lucasmoura @paride , this would be a breaking change, but I think it is necessary. Given the number of places wait is called (launch, start, restart, other wait methods), along with the different places we have something along the lines of if wait, it takes a lot of extra code in various places to achieve a wait without raising. (The latest example is #92 and the EC2 code we're accidentally skipping ).

For the places relying on a raise, it would require an .execute('cloud-init status') and checking the result. Does this work for everybody?

paride · 2021-01-15T09:04:47Z

@TheRealFalcon thanks for the heads-up, works for the bootspeed jobs.

lucasmoura · 2021-01-15T13:21:31Z

                try:
-                    super()._wait_for_cloudinit(
-                        raise_on_failure=True)
+                    super()._wait_for_cloudinit()


I don't think this code will work anymore, right ? Since _wait_for_cloudinit will not raise the OSError anymore, we will never properly wait for cloud-init status to be over.

Also, I am little bit worried about this change, because even though we can adapt the uaclient code to have that explicit behavior of checking the cloud-init status return, any user that will need to launch a VM will need to know beforehand that they should wait for the lxd-agent to be ready. This means they will need to replicate this for loop logic on their code, which I don't think it is optimal.

I thought the OSError was coming from LXD itself before we called cloud-init status, which means we would continue to get the error even after this change. Is that not true? I tried testing locally, but I'm actually not seeing any of these errors anymore for LXD VMs? Am I just lucky or did these issues maybe get resolved and it's all a moot point anyway 😄 ?

Those are good questions. Let me reproduce that change and run some ua tests so I can test what happens in that scenario

@TheRealFalcon I think I know what is happening on the lxd vms. I think we are not hitting the lxd-agent issue because we are actually treating that already on the _wait_for_execute method. There, it tries to run a whoami command and if we hit the lxd-agentissue, we will sleep and try the command again. Therefore, we will have enough time for the lxd-agent to be setup.

However, regarding the OSError issue, the _wait_for_cloudinit method is not raising it. For example, if we comment the _wait_for_execute function and try to launch a lxd vm., we will see that the _wait_for_cloudinit command will fail because of the lxd-agent and it will not raise an OSError, meaning that the lxd loop we have will not work

wrong review

TheRealFalcon · 2021-01-20T23:05:03Z

@lucasmoura Just to double check, since the _wait_for_execute() will catch errors, this is ok to merge? If I get the approval I'll merge manually. Travis is failing cloud-init tests because cloud-init will also need a corresponding change.

lucasmoura · 2021-01-21T14:40:09Z

@TheRealFalcon I think we just need a couple of change before merging it:

Since the _wait_for_execute command is handling the lxd-agent issue on lxd vms, I don't think we need the _wait_for_cloudinit override on lxd/instance anymore.
I think we should turn the _wait_for_cloudinit into a public method and return the result of the cloud-init status call. That way, users will be able to respond to it appropriately.

Also, @blackboxsw do you mind reviewing this too ? Just to see if I am not missing anything ? I have run a uaclient vm test with those changes and also without the _wait_for_cloudinit override in lxd/instance and everything seems to be working fine. But maybe I am still missing something

TheRealFalcon · 2021-01-21T14:59:00Z

@lucasmoura Agree on point 1. On point 2,

I think we should turn the _wait_for_cloudinit into a public method and return the result of the cloud-init status call. That way, users will be able to respond to it appropriately.

I thought about that too, but also thought it might be a little unintuitive to receive a cloud-init status Result when calling start/restart/wait/etc (those all have _wait_for_cloudinit in their call hierarchy). It's only one extra line to make an additional execute call to get cloud-init status after _wait_for_cloudinit has returned. I could go either way though. @lucasmoura @blackboxsw @OddBloke , any of you have strong opinions?

lucasmoura · 2021-01-21T15:24:18Z

I don't see a problem with the execute("cloud-init status")call to assert that. But I think it will be more intuitive for an user to have _wait_for_cloudinit to return the status that cloud-init reached at end. Regarding the start/restart/wait methods, I don't think it will be an issue, since they would not do anything with the result returned by _wait_for_cloudinit.

But I think @OddBloke and @blackboxsw opinions on the topic will be really useful on deciding that.

TheRealFalcon · 2021-01-21T15:43:21Z

@lucasmoura I responded to the "return Result" aspect of your proposal, but didn't mention anything about the "make it public" part, but it looks like the outcome of the latter feeds into the former. If these other methods (start/restart/etc) are already calling _wait_for_cloudinit, I'm not seeing why we would want to make it public or why people would want to call it manually. When do you think it would make sense to call a public wait_for_cloudinit?

lucasmoura · 2021-01-21T16:12:04Z

@TheRealFalcon I thought about cases where users want to guarantee that cloud-init status was not a failure. But I do see your point that this feels a little bit redundant since restart/start/etc already calls that method.

Looking back to it, I think having a supported way to see the status of cloud-init will not be that much useful. So I am okay with the execute approach.

The 'raise_on_failure' aspect of waiting for cloud-init to start is causing a fair amount of additional code and workarounds. It was introduced as an optional parameter as to not break the existing API, but given the amount of heartburn it's causing, it seems best to remove it entirely. If one needs the original behavior, checking cloud-init status via an execute call should suffice.

lucasmoura previously approved these changes Jan 15, 2021

View reviewed changes

TheRealFalcon force-pushed the remove-raise-on-failure branch from 66fa68c to 04d3fa4 Compare January 22, 2021 17:36

TheRealFalcon force-pushed the remove-raise-on-failure branch from 04d3fa4 to 87ee40b Compare January 22, 2021 17:37

lucasmoura approved these changes Jan 22, 2021

View reviewed changes

TheRealFalcon merged commit 878981e into master Jan 22, 2021

TheRealFalcon deleted the remove-raise-on-failure branch January 22, 2021 18:25

TheRealFalcon mentioned this pull request Jan 25, 2021

Remove 'remove-raise-on-failure' calls from integration_tests canonical/cloud-init#788

Merged

3 tasks

TheRealFalcon mentioned this pull request Feb 9, 2021

WIP: Track boot id on each boot #92

Closed

lucasmoura mentioned this pull request Feb 16, 2021

Update lxc-setup script on lxd vm instances #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove raise_on_failure from wait calls#93

Remove raise_on_failure from wait calls#93
TheRealFalcon merged 1 commit into
masterfrom
remove-raise-on-failure

TheRealFalcon commented Jan 14, 2021

Uh oh!

TheRealFalcon commented Jan 14, 2021

Uh oh!

paride commented Jan 15, 2021

Uh oh!

lucasmoura Jan 15, 2021

Uh oh!

TheRealFalcon Jan 15, 2021

Uh oh!

lucasmoura Jan 18, 2021

Uh oh!

lucasmoura Jan 20, 2021

Uh oh!

TheRealFalcon commented Jan 20, 2021

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

TheRealFalcon commented Jan 21, 2021 •

edited

Loading

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

TheRealFalcon commented Jan 21, 2021

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

TheRealFalcon commented Jan 14, 2021

Uh oh!

TheRealFalcon commented Jan 14, 2021

Uh oh!

paride commented Jan 15, 2021

Uh oh!

lucasmoura Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

TheRealFalcon Jan 15, 2021

Choose a reason for hiding this comment

Uh oh!

lucasmoura Jan 18, 2021

Choose a reason for hiding this comment

Uh oh!

lucasmoura Jan 20, 2021

Choose a reason for hiding this comment

Uh oh!

TheRealFalcon commented Jan 20, 2021

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

TheRealFalcon commented Jan 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

TheRealFalcon commented Jan 21, 2021

Uh oh!

lucasmoura commented Jan 21, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TheRealFalcon commented Jan 21, 2021 •

edited

Loading