Fallback to cached datasource if no valid datasource#229
Conversation
If no valid datasource was found after the search for network datasources, we assume our datasource has gone away and try to use the last cached datasource if it hasn't been removed from the configuration. An example where this is useful is when the EC2 IMDS endpoint gets disabled after the 1st boot.
raharper
left a comment
There was a problem hiding this comment.
Thanks for putting this PR up. I need to double check, but I do have some concerns about DSMODE_NETWORK only for this change. Ec2 (and others) run both in DSMODE_LOCAL and DSMODE_NETWORK. At local time, we will attempt to bring up networking to fetch IMDS data (which includes networking configuration). In your scenario where IMDS is not availble, it's not available in either LOCAL or NETWORK mode of Ec2 datasource; and cloud-init-local.service is going to take the sources.DSMODE_LOCAL path first, purge_cache() which is going to remove the instance_link; and then run an init.fetch();
In general, I like the idea of falling back on the existing datasource; there are other PRs in-flight looking to do something similiar (in VMWare OVF) so there's clearly a need for a general solution (though we can defer generalizing if it's easier to land DS specific changes).
I'd like to think a bit more on:
- For datasources which do run at local time, can we determine that we're still on the previous datasource, but _get_data() has failed (ie, imds is down, or whatever config put isn't available , OVF doesn't have a custom config file.
- If we can detect (1); would the situation change if we're DSMODE_NETWORK ?; we may need to defer till a later stage (cloud-init.service, vs, cloud-init-local.service) to decide that we cannot have a successful _get_data()
- how do we communicate from local stage, to init stage that we've already determined that the DS is not available so that we don't repeat the checks we did?
Isn't that what we do right now? I'll have to run more tests. On Amazon Linux, we don't have networking in the local init so that it always fails. I assume that if we had networking and no IMDS endpoint, it would fail to find a datasource, move on to init and then, fail again to find a datasource. As long as we fail fast in both cases, we want to failover to a cached staged after the non-local init has failed.... when we know we are not going to find an alternate datasource and, either we fall back to a cached data source or we will move on anyway without a datasource at all. |
|
@raharper : I may be misunderstanding the relationship between local-init and init but here is why I think this is safe/correct:
#3, above, is the only case where we have a significant change of behavior. Can we do better? We could increase coordination/communication between init-local and init but that's a fairly involved architectural change. We would not be dealing with 2 independent and almost stateless processes anymore. With the only tangible benefit being a faster path to an exceptional fallback, I don't think it's worth the effort and potential side-effects. The return would be better if we, instead, made sure that |
Yes.
Yes. However, in the case where we don't have check_instance_id() without providing a trust flag between local and init; the check in local is wasted. We cannot check only in local as some datasources are Network only, that is they never get "found" when DSMODE=local.
Yes
Thanks for the detailed walk-through. I'm in agreement with your assessment and I'm itching to fix up the duplication of fetch and the lack of coordination between local and net mode datasources. Given that cloud-init (typically) enables only the datasource it has detected, and we can know if it's local-only, network-only, or both; we do know when we can check if the datasource is viable or not and do that only once. However, as you point out, that's a larger change and this PR is an improvement over existing behavior. |
raharper
left a comment
There was a problem hiding this comment.
Thanks for working on this. I've left a question on how much or little we should log in the scenarios where we don't match the cached datasource, looking for your thoughts there. I'd like to see a unittest exercising this path. We can help facilitate that.
| else: | ||
| LOG.info("Cached datasource '%s' doesn't match " | ||
| "datasources enabled in the configuration", | ||
| ds.dsname) |
There was a problem hiding this comment.
Couple of thoughts here. On modern instance which use ds-identify to detect the datasources, this ds list is just two elements, the detected DS and None; However, on Xenial, for example, the list_sources() will be longer. The else clause will only print out the last ds name it checked, so I'm wondering if it would be useful to users (or devs) who debug the scenario where the cached datasource doesn't match, to include the list of ds names that were checked.
Second though, in the case that the name matches, but the instance-id's don't match; we only see that the last ds didn't match the cached datasource; I would think we'd want to indicate that we didn't use the cached datasource due to previous_iid() not matching restored_iid.
There was a problem hiding this comment.
I'm wondering if it would be useful to users (or devs) who debug the scenario where the cached datasource doesn't match, to include the list of ds names that were checked.
Yes. I agree.
Second though, in the case that the name matches, but the instance-id's don't match; we only see that the last ds didn't match the cached datasource; I would think we'd want to indicate that we didn't use the cached datasource due to previous_iid() not matching restored_iid.
In insight, I'm not sure this check is even useful. Because we are not checking against the current instance ID (that would be check_instance_id()'s job). So... is there really any case where the instance id in the pickled DataSource object would not match the instance id stored in previous-instance-id ? After all, if we move the volume to a new instance, we bring along both the pickled object and /var/lib/cloud/data.
|
Hi Fred, sorry for not updating this last week. @blackboxsw and I had a discussion about this change, and in particular, the burden of hitting IMDS twice when we're not in the fallback scenario is too heavy. On some platforms, crawling IMDS may be tens of seconds, and re-crawling at DSMODE=net due to the existing=check change means we'll recrawl all of that a second time for no gain. As such, we'd like to step back a bit and see how we can retain the existing=trust scenario so that if local does find a datasource and it can get data, then at net time we do not want or need to recrawl. We've started a document to capture some of the discussion: |
|
Fred, Thanks for being patient with us. I updated the hackmd document with more of a matrix on datasources, which are local or net, or both and whether we have a check_instance_id or not. With all of that documented, my reading of that information points to me that only Ec2 datasource runs at local and net times and also does NOT implement check_instance_id(); which means Ec2 is the only case where we'd double-crawl IMDS. To be certain, I packaged up this branch and ran this on OpenStack and Ec2. On OpenStack, I can confirm we crawl at local time, persist the object, at net we check if we can restore from cache, we do, and then do not crawl IMDS a second time, no change in total boot time from a cloud-init perspective. On Ec2, I tested this and even though Ec2 does not implement the check_instance_id, because cloud-init persist the object at local time, we only run the restore_from_checked_cache twice, which, as you said, is not much work at all. I'm becoming more comfortable with this approach. I'll update my review here tomorrow with some feedback/changes; around logging and unittests. |
| if not ds: | ||
| raise e | ||
| dsname = ds.dsname | ||
| restored_iid = ds.metadata.get('instance-id', None) |
There was a problem hiding this comment.
Why aren't we using ds.get_instance_id() here?
| # Could not find a valid datasource. | ||
| if fallback: | ||
| LOG.info("No datasource found - " | ||
| "Trying to Load last one from cache") |
There was a problem hiding this comment.
| "Trying to Load last one from cache") | |
| "Trying to load previous datasource from cache") |
| try: | ||
| init.fetch(existing=existing) | ||
| init.fetch(existing=existing, | ||
| fallback=(mode == sources.DSMODE_NETWORK)) |
There was a problem hiding this comment.
This currently restricts fallback to only NETWORK datasources. I believe OVF would like this set to fallback=True; which would allow a local-time datasource to fallback to the previous datasource.
Testing this out, I simulated a dead IMDS by dropping packets heading there:
sudo iptables -A OUTPUT -d 169.254.169.254 -j DROP
Then I ran
cloud-init --debug init --local
cloud-init --debug init
The previous object was restored from cache at each stage.
| LOG.debug(myrep.description) | ||
|
|
||
| if not ds: | ||
| util.del_file(self.paths.instance_link) |
There was a problem hiding this comment.
In my testing on ec2 with this branch, the 'check-cache' always finds the instance obj.plk, which AFAICT, means that we never go down the DataSourceNotFound.
The only time I could force this was if the obj.pkl did not exist, then find_source() would call update_metadata() which calls get_data(). If IMDS is down, then we return False and do not find any datsources. This triggers the DataSourceNotFoundException path.
Down that path, it then calls self._restore_from_checked_cache() which will fail the same way (no obj.plk) so, I cannot see how the rest of the exception handler is doing anything. What am I missing?
|
Hello! Thank you for this proposed change to cloud-init. This pull request is now marked as stale as it has not seen any activity in 14 days. If no activity occurs within the next 7 days, this pull request will automatically close. If you are waiting for code review and you are seeing this message, apologies! Please reply, tagging powersj, and he will ensure that someone takes a look soon. (If the pull request is closed, please do feel free to reopen it if you wish to continue working on it.) |
commit message:
some notes about this change:
instance_link(/var/lib/cloud/instance) so that remains after we have failed to find a working network datasourceexisting=check, so that it will only reuse a cache that would have been left behind without the change in Add a spec draft for the new distro namespace. #1.find_source()has failed in non-local mode, we add a fallback boolean tofetch()that is only set to true ifmode == sources.DSMODE_NETWORKexcept sources.DataSourceNotFoundExceptionblock to a function but it seemed easier to understand inside the context of_get_data_source(). I also considered modifying_restore_from_checked_cache()to handle this use case and just calling it again inside theexceptblock. Thoughts?fallback = (mode == sources.DSMODE_NETWORK)if a new config option was set to minimise the blast radius of this change. But in the end, I felt the change made sense as a default. I'm happy to reconsider.