Skip to content

Remove raise_on_failure from wait calls#93

Merged
TheRealFalcon merged 1 commit into
masterfrom
remove-raise-on-failure
Jan 22, 2021
Merged

Remove raise_on_failure from wait calls#93
TheRealFalcon merged 1 commit into
masterfrom
remove-raise-on-failure

Conversation

@TheRealFalcon
Copy link
Copy Markdown
Contributor

The 'raise_on_failure' aspect of waiting for cloud-init to start is
causing a fair amount of additional code and workarounds. It was
introduced as an optional parameter as to not break the existing
API, but given the amount of heartburn it's causing, it seems best to
remove it entirely. If one needs the original behavior, checking cloud-init
status via an execute call should suffice.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

@blackboxsw @lucasmoura @paride , this would be a breaking change, but I think it is necessary. Given the number of places wait is called (launch, start, restart, other wait methods), along with the different places we have something along the lines of if wait, it takes a lot of extra code in various places to achieve a wait without raising. (The latest example is #92 and the EC2 code we're accidentally skipping ).

For the places relying on a raise, it would require an .execute('cloud-init status') and checking the result. Does this work for everybody?

@paride
Copy link
Copy Markdown
Collaborator

paride commented Jan 15, 2021

@TheRealFalcon thanks for the heads-up, works for the bootspeed jobs.

lucasmoura
lucasmoura previously approved these changes Jan 15, 2021
Comment thread pycloudlib/lxd/instance.py Outdated
try:
super()._wait_for_cloudinit(
raise_on_failure=True)
super()._wait_for_cloudinit()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this code will work anymore, right ? Since _wait_for_cloudinit will not raise the OSError anymore, we will never properly wait for cloud-init status to be over.

Also, I am little bit worried about this change, because even though we can adapt the uaclient code to have that explicit behavior of checking the cloud-init status return, any user that will need to launch a VM will need to know beforehand that they should wait for the lxd-agent to be ready. This means they will need to replicate this for loop logic on their code, which I don't think it is optimal.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the OSError was coming from LXD itself before we called cloud-init status, which means we would continue to get the error even after this change. Is that not true? I tried testing locally, but I'm actually not seeing any of these errors anymore for LXD VMs? Am I just lucky or did these issues maybe get resolved and it's all a moot point anyway 😄 ?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are good questions. Let me reproduce that change and run some ua tests so I can test what happens in that scenario

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TheRealFalcon I think I know what is happening on the lxd vms. I think we are not hitting the lxd-agent issue because we are actually treating that already on the _wait_for_execute method. There, it tries to run a whoami command and if we hit the lxd-agentissue, we will sleep and try the command again. Therefore, we will have enough time for the lxd-agent to be setup.

However, regarding the OSError issue, the _wait_for_cloudinit method is not raising it. For example, if we comment the _wait_for_execute function and try to launch a lxd vm., we will see that the _wait_for_cloudinit command will fail because of the lxd-agent and it will not raise an OSError, meaning that the lxd loop we have will not work

@lucasmoura lucasmoura dismissed their stale review January 15, 2021 13:23

wrong review

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

@lucasmoura Just to double check, since the _wait_for_execute() will catch errors, this is ok to merge? If I get the approval I'll merge manually. Travis is failing cloud-init tests because cloud-init will also need a corresponding change.

@lucasmoura
Copy link
Copy Markdown

@TheRealFalcon I think we just need a couple of change before merging it:

  • Since the _wait_for_execute command is handling the lxd-agent issue on lxd vms, I don't think we need the _wait_for_cloudinit override on lxd/instance anymore.

  • I think we should turn the _wait_for_cloudinit into a public method and return the result of the cloud-init status call. That way, users will be able to respond to it appropriately.

Also, @blackboxsw do you mind reviewing this too ? Just to see if I am not missing anything ? I have run a uaclient vm test with those changes and also without the _wait_for_cloudinit override in lxd/instance and everything seems to be working fine. But maybe I am still missing something

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

TheRealFalcon commented Jan 21, 2021

@lucasmoura Agree on point 1. On point 2,

I think we should turn the _wait_for_cloudinit into a public method and return the result of the cloud-init status call. That way, users will be able to respond to it appropriately.

I thought about that too, but also thought it might be a little unintuitive to receive a cloud-init status Result when calling start/restart/wait/etc (those all have _wait_for_cloudinit in their call hierarchy). It's only one extra line to make an additional execute call to get cloud-init status after _wait_for_cloudinit has returned. I could go either way though. @lucasmoura @blackboxsw @OddBloke , any of you have strong opinions?

@lucasmoura
Copy link
Copy Markdown

I don't see a problem with the execute("cloud-init status")call to assert that. But I think it will be more intuitive for an user to have _wait_for_cloudinit to return the status that cloud-init reached at end. Regarding the start/restart/wait methods, I don't think it will be an issue, since they would not do anything with the result returned by _wait_for_cloudinit.

But I think @OddBloke and @blackboxsw opinions on the topic will be really useful on deciding that.

@TheRealFalcon
Copy link
Copy Markdown
Contributor Author

@lucasmoura I responded to the "return Result" aspect of your proposal, but didn't mention anything about the "make it public" part, but it looks like the outcome of the latter feeds into the former. If these other methods (start/restart/etc) are already calling _wait_for_cloudinit, I'm not seeing why we would want to make it public or why people would want to call it manually. When do you think it would make sense to call a public wait_for_cloudinit?

@lucasmoura
Copy link
Copy Markdown

@TheRealFalcon I thought about cases where users want to guarantee that cloud-init status was not a failure. But I do see your point that this feels a little bit redundant since restart/start/etc already calls that method.

Looking back to it, I think having a supported way to see the status of cloud-init will not be that much useful. So I am okay with the execute approach.

@TheRealFalcon TheRealFalcon force-pushed the remove-raise-on-failure branch from 66fa68c to 04d3fa4 Compare January 22, 2021 17:36
The 'raise_on_failure' aspect of waiting for cloud-init to start is
causing a fair amount of additional code and workarounds. It was
introduced as an optional parameter as to not break the existing
API, but given the amount of heartburn it's causing, it seems best to
remove it entirely. If one needs the original behavior, checking cloud-init
status via an execute call should suffice.
@TheRealFalcon TheRealFalcon force-pushed the remove-raise-on-failure branch from 04d3fa4 to 87ee40b Compare January 22, 2021 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants