Fallback to cached datasource if no valid datasource by fred-lefebvre · Pull Request #229 · canonical/cloud-init

fred-lefebvre · 2020-02-29T00:29:21Z

commit message:

If no valid datasource was found after the search for
network datasources, we assume our datasource has gone
away and try to use the last cached datasource if
it hasn't been removed from the configuration.

An example where this is useful is when the EC2
IMDS endpoint gets disabled after the 1st boot.

some notes about this change:

This change extends the life of instance_link (/var/lib/cloud/instance) so that remains after we have failed to find a working network datasource
To mitigate Add a spec draft for the new distro namespace. #1, we switch the non-local init to set existing=check, so that it will only reuse a cache that would have been left behind without the change in Add a spec draft for the new distro namespace. #1.
To make sure we only use the fallback after find_source() has failed in non-local mode, we add a fallback boolean to fetch() that is only set to true if mode == sources.DSMODE_NETWORK
I considered moving most the code inside the except sources.DataSourceNotFoundException block to a function but it seemed easier to understand inside the context of _get_data_source(). I also considered modifying _restore_from_checked_cache() to handle this use case and just calling it again inside the except block. Thoughts?
My initial intent was to only set fallback = (mode == sources.DSMODE_NETWORK) if a new config option was set to minimise the blast radius of this change. But in the end, I felt the change made sense as a default. I'm happy to reconsider.

If no valid datasource was found after the search for network datasources, we assume our datasource has gone away and try to use the last cached datasource if it hasn't been removed from the configuration. An example where this is useful is when the EC2 IMDS endpoint gets disabled after the 1st boot.

raharper

Thanks for putting this PR up. I need to double check, but I do have some concerns about DSMODE_NETWORK only for this change. Ec2 (and others) run both in DSMODE_LOCAL and DSMODE_NETWORK. At local time, we will attempt to bring up networking to fetch IMDS data (which includes networking configuration). In your scenario where IMDS is not availble, it's not available in either LOCAL or NETWORK mode of Ec2 datasource; and cloud-init-local.service is going to take the sources.DSMODE_LOCAL path first, purge_cache() which is going to remove the instance_link; and then run an init.fetch();

In general, I like the idea of falling back on the existing datasource; there are other PRs in-flight looking to do something similiar (in VMWare OVF) so there's clearly a need for a general solution (though we can defer generalizing if it's easier to land DS specific changes).

I'd like to think a bit more on:

For datasources which do run at local time, can we determine that we're still on the previous datasource, but _get_data() has failed (ie, imds is down, or whatever config put isn't available , OVF doesn't have a custom config file.
If we can detect (1); would the situation change if we're DSMODE_NETWORK ?; we may need to defer till a later stage (cloud-init.service, vs, cloud-init-local.service) to decide that we cannot have a successful _get_data()
how do we communicate from local stage, to init stage that we've already determined that the DS is not available so that we don't repeat the checks we did?

fred-lefebvre · 2020-03-03T23:45:21Z

For datasources which do run at local time, can we determine that we're still on the previous datasource, but _get_data() has failed (ie, imds is down, or whatever config put isn't available , OVF doesn't have a custom config file.

If we can detect (1); would the situation change if we're DSMODE_NETWORK ?; we may need to defer till a later stage (cloud-init.service, vs, cloud-init-local.service) to decide that we cannot have a successful _get_data()

how do we communicate from local stage, to init stage that we've already determined that the DS is not available so that we don't repeat the checks we did?

Isn't that what we do right now? I'll have to run more tests. On Amazon Linux, we don't have networking in the local init so that it always fails. I assume that if we had networking and no IMDS endpoint, it would fail to find a datasource, move on to init and then, fail again to find a datasource. As long as we fail fast in both cases, we want to failover to a cached staged after the non-local init has failed.... when we know we are not going to find an alternate datasource and, either we fall back to a cached data source or we will move on anyway without a datasource at all.

fred-lefebvre · 2020-03-04T06:49:52Z

@raharper : I may be misunderstanding the relationship between local-init and init but here is why I think this is safe/correct:

If at any point we are dealing with a datasource where check_instance_id() is implemented and returns True, we will never exercise this patch.
If init-local finds a datasource, whether through the cache or find_source(), it will persist it to disc and replace the instance_link. init, setting existing=check will find the matching datasource persisted by init-local and load it without ever making it to find_source() [Here we will never make it to DataSourceNotFoundException and the fallback but, because of we switched existing=trust to existing=check, we make a second call to check_instance_id() ]
If init-local doesn't find a datasource, it will leave the instance_link alone (it would have previously deleted it), init will run find_source() and, only if that raises DataSourceNotFoundException, we will try to load a cached datasource without regard to check_instance_id(). [In this case, we again have the second call to check_instance_id() but this time we enter the new except block and we reuse a cached datasource if it still matches the config]

#3, above, is the only case where we have a significant change of behavior. Can we do better? We could increase coordination/communication between init-local and init but that's a fairly involved architectural change. We would not be dealing with 2 independent and almost stateless processes anymore. With the only tangible benefit being a faster path to an exceptional fallback, I don't think it's worth the effort and potential side-effects. The return would be better if we, instead, made sure that find_source() fails as fast as possible when no datasource is present (that work is specific to each datasource).

raharper · 2020-03-04T16:47:14Z

@raharper : I may be misunderstanding the relationship between local-init and init but here is why I think this is safe/correct:

If at any point we are dealing with a datasource where check_instance_id() is implemented and returns True, we will never exercise this patch.

Yes.

If init-local finds a datasource, whether through the cache or find_source(), it will persist it to disc and replace the instance_link. init, setting existing=check will find the matching datasource persisted by init-local and load it without ever making it to find_source() [Here we will never make it to DataSourceNotFoundException and the fallback but, because of we switched existing=trust to existing=check, we make a second call to check_instance_id() ]

Yes. However, in the case where we don't have check_instance_id() without providing a trust flag between local and init; the check in local is wasted. We cannot check only in local as some datasources are Network only, that is they never get "found" when DSMODE=local.

If init-local doesn't find a datasource, it will leave the instance_link alone (it would have previously deleted it), init will run find_source() and, only if that raises DataSourceNotFoundException, we will try to load a cached datasource without regard to check_instance_id(). [In this case, we again have the second call to check_instance_id() but this time we enter the new except block and we reuse a cached datasource if it still matches the config]

Yes

#3, above, is the only case where we have a significant change of behavior. Can we do better? We could increase coordination/communication between init-local and init but that's a fairly involved architectural change. We would not be dealing with 2 independent and almost stateless processes anymore. With the only tangible benefit being a faster path to an exceptional fallback, I don't think it's worth the effort and potential side-effects. The return would be better if we, instead, made sure that find_source() fails as fast as possible when no datasource is present (that work is specific to each datasource).

Thanks for the detailed walk-through. I'm in agreement with your assessment and I'm itching to fix up the duplication of fetch and the lack of coordination between local and net mode datasources. Given that cloud-init (typically) enables only the datasource it has detected, and we can know if it's local-only, network-only, or both; we do know when we can check if the datasource is viable or not and do that only once. However, as you point out, that's a larger change and this PR is an improvement over existing behavior.

raharper

Thanks for working on this. I've left a question on how much or little we should log in the scenarios where we don't match the cached datasource, looking for your thoughts there. I'd like to see a unittest exercising this path. We can help facilitate that.

raharper · 2020-03-04T17:15:01Z

+                else:
+                    LOG.info("Cached datasource '%s' doesn't match "
+                             "datasources enabled in the configuration",
+                             ds.dsname)


Couple of thoughts here. On modern instance which use ds-identify to detect the datasources, this ds list is just two elements, the detected DS and None; However, on Xenial, for example, the list_sources() will be longer. The else clause will only print out the last ds name it checked, so I'm wondering if it would be useful to users (or devs) who debug the scenario where the cached datasource doesn't match, to include the list of ds names that were checked.

Second though, in the case that the name matches, but the instance-id's don't match; we only see that the last ds didn't match the cached datasource; I would think we'd want to indicate that we didn't use the cached datasource due to previous_iid() not matching restored_iid.

I'm wondering if it would be useful to users (or devs) who debug the scenario where the cached datasource doesn't match, to include the list of ds names that were checked.

Yes. I agree.

Second though, in the case that the name matches, but the instance-id's don't match; we only see that the last ds didn't match the cached datasource; I would think we'd want to indicate that we didn't use the cached datasource due to previous_iid() not matching restored_iid.

In insight, I'm not sure this check is even useful. Because we are not checking against the current instance ID (that would be check_instance_id()'s job). So... is there really any case where the instance id in the pickled DataSource object would not match the instance id stored in previous-instance-id ? After all, if we move the volume to a new instance, we bring along both the pickled object and /var/lib/cloud/data.

raharper · 2020-03-10T16:09:33Z

Hi Fred,

sorry for not updating this last week. @blackboxsw and I had a discussion about this change, and in particular, the burden of hitting IMDS twice when we're not in the fallback scenario is too heavy. On some platforms, crawling IMDS may be tens of seconds, and re-crawling at DSMODE=net due to the existing=check change means we'll recrawl all of that a second time for no gain.

As such, we'd like to step back a bit and see how we can retain the existing=trust scenario so that if local does find a datasource and it can get data, then at net time we do not want or need to recrawl.

We've started a document to capture some of the discussion:

https://hackmd.io/73cx4VCJQDuAdEaOu1xSGQ

raharper · 2020-03-23T23:00:09Z

Fred,

Thanks for being patient with us. I updated the hackmd document with more of a matrix on datasources, which are local or net, or both and whether we have a check_instance_id or not. With all of that documented, my reading of that information points to me that only Ec2 datasource runs at local and net times and also does NOT implement check_instance_id(); which means Ec2 is the only case where we'd double-crawl IMDS.

To be certain, I packaged up this branch and ran this on OpenStack and Ec2. On OpenStack, I can confirm we crawl at local time, persist the object, at net we check if we can restore from cache, we do, and then do not crawl IMDS a second time, no change in total boot time from a cloud-init perspective.

On Ec2, I tested this and even though Ec2 does not implement the check_instance_id, because cloud-init persist the object at local time, we only run the restore_from_checked_cache twice, which, as you said, is not much work at all.

I'm becoming more comfortable with this approach. I'll update my review here tomorrow with some feedback/changes; around logging and unittests.

blackboxsw · 2020-03-06T04:11:24Z

+                if not ds:
+                    raise e
+                dsname = ds.dsname
+                restored_iid = ds.metadata.get('instance-id', None)


Why aren't we using ds.get_instance_id() here?

raharper · 2020-03-24T19:21:55Z

+                # Could not find a valid datasource.
+                if fallback:
+                    LOG.info("No datasource found - "
+                             "Trying to Load last one from cache")


Suggested change

"Trying to Load last one from cache")

"Trying to load previous datasource from cache")

raharper · 2020-03-24T22:26:03Z

    try:
-        init.fetch(existing=existing)
+        init.fetch(existing=existing,
+                   fallback=(mode == sources.DSMODE_NETWORK))


This currently restricts fallback to only NETWORK datasources. I believe OVF would like this set to fallback=True; which would allow a local-time datasource to fallback to the previous datasource.

Testing this out, I simulated a dead IMDS by dropping packets heading there:

sudo iptables -A OUTPUT -d 169.254.169.254 -j DROP

Then I ran

cloud-init --debug init --local cloud-init --debug init

The previous object was restored from cache at each stage.

raharper · 2020-03-24T22:29:34Z

            LOG.debug(myrep.description)

        if not ds:
-            util.del_file(self.paths.instance_link)


In my testing on ec2 with this branch, the 'check-cache' always finds the instance obj.plk, which AFAICT, means that we never go down the DataSourceNotFound.

The only time I could force this was if the obj.pkl did not exist, then find_source() would call update_metadata() which calls get_data(). If IMDS is down, then we return False and do not find any datsources. This triggers the DataSourceNotFoundException path.

Down that path, it then calls self._restore_from_checked_cache() which will fail the same way (no obj.plk) so, I cannot see how the rest of the exception handler is doing anything. What am I missing?

github-actions · 2020-04-08T00:04:21Z

Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close.

If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging powersj, and he will ensure that someone takes a look soon.

(If the pull request is closed, please do feel free to reopen it if you wish to continue working on it.)

raharper suggested changes Mar 3, 2020

View reviewed changes

fred-lefebvre mentioned this pull request Mar 3, 2020

ec2: Do not fallback to IMDSv1 on EC2 #216

Merged

fred-lefebvre requested a review from raharper March 3, 2020 23:46

raharper suggested changes Mar 4, 2020

View reviewed changes

raharper mentioned this pull request Mar 4, 2020

OVF: Prevent cloud-init applying default network configuration when t… #56

Closed

blackboxsw reviewed Mar 24, 2020

View reviewed changes

OddBloke assigned raharper Mar 24, 2020

raharper suggested changes Mar 24, 2020

View reviewed changes

github-actions Bot added the stale-pr Pull request is stale; will be auto-closed soon label Apr 8, 2020

github-actions Bot closed this Apr 15, 2020

raharper mentioned this pull request Nov 2, 2020

Fallback to cached datasouce when no valid DS #647

Closed

This was referenced May 11, 2023

OVF datasource should check if instant id is still on VMware Platform #3402

Closed

Cloudinit resets network scripts to default configuration DHCP once Config Drive is removed after sometime #3777

Open

	"Trying to Load last one from cache")
	"Trying to load previous datasource from cache")

Conversation

fred-lefebvre commented Feb 29, 2020

Uh oh!

raharper left a comment

Choose a reason for hiding this comment

Uh oh!

fred-lefebvre commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fred-lefebvre commented Mar 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raharper commented Mar 4, 2020

Uh oh!

raharper left a comment

Choose a reason for hiding this comment

Uh oh!

raharper Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

fred-lefebvre Mar 4, 2020

Choose a reason for hiding this comment

Uh oh!

raharper commented Mar 10, 2020

Uh oh!

raharper commented Mar 23, 2020

Uh oh!

blackboxsw Mar 6, 2020

Choose a reason for hiding this comment

Uh oh!

raharper Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

raharper Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

raharper Mar 24, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 8, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fred-lefebvre commented Mar 3, 2020 •

edited

Loading

fred-lefebvre commented Mar 4, 2020 •

edited

Loading