Azure: During primary nic detection, check interface status continuously before rebinding again#990
Conversation
365c3eb to
b1cc7d9
Compare
blackboxsw
left a comment
There was a problem hiding this comment.
Content and retries look good. I guess I'd like some sort of debug log breadcrumb that we are polling so we know where we are spending time.
Co-authored-by: Chad Smith <chad.smith@canonical.com>
| sleep_duration = 0.5 | ||
| max_status_polls = 20 | ||
| LOG.debug("Polling %d seconds for primary NIC link up after rebind.", sleep_duration * max_status_polls) | ||
| for i in range(0, max_status_polls): |
There was a problem hiding this comment.
Thanks @aswinrajamannar for the quick fix, given that this polling on networking.is_up seems to be generally helpful, do you think it would make sense to either define a local function to wrap self.distro-networking.is_up and the polling logic into a simple function that we can also call above before the while True loop on the initial self.distro.networking.try_set_link_up check?
There was a problem hiding this comment.
But, maybe this retry logic doesn't really payoff on the first call to self.distro.networking.try_set_link_up in wait_for_link_up and only adds 10 seconds of delay to the whole process because we expect wait_for_link_up to succeed immediately. If first linkup attempt fails, then we do a deeper unbind/bind loop with retries
There was a problem hiding this comment.
The polling is not needed before the loop. The polling is necessary here mainly because we wrote to hv_netvsc/unbind and hv_netvsc/bind, which is forcing the driver to refresh the state of the link. After doing the unbind & bind, it seems to take a few seconds for the link to be actually up that's why we wait for is_up before calling unbind & bind again.
In most of the cases, the link would be up after just one call to try_set_link_up so we don't need to waste time doing unbind and bind. That is why it is added before the while loop. If that first check doesn't bring the link up, simply waiting longer without doing unbind and bind is not going to help and it would be an unnecessary delay. Hope this clarifies a little bit..? Or let me know if I misunderstood your comment.
There was a problem hiding this comment.
+1 this general polling looks like it could be reasonable in for more general use-cases down in https://github.com/canonical/cloud-init/blob/main/cloudinit/distros/networking.py#L226 if we were to add an optional "retries" parameter or something that if set, we'd retry the check on networking.is_up for however many retries are provided. I don't think this needs to be fixed in this branch though, but it might be something we request on a followup PR.
blackboxsw
left a comment
There was a problem hiding this comment.
Thanks @aswinrajamannar. Things look pretty good as far as logic layout. I have a couple of minor nits that you can feel free to disagree with.
- Can we tune down the general log level from LOG.info to LOG.debug for things that are not important for the typical person to consume or react to? General informational messages that don't require human-reaction could generally be classified by cloud-init as debug messages. Things like inteded configuration breaking or being ignored can qualify as LOG.info messages .
- Also along this lines of too much logging noise, do we really want to LOG.debug every single retry of the " Link is not up after %d attempts to rebind" ? It seems we have bookend logs that will do the trick telling us that we are starting to unbind/rebind and the number or attempts until success.
Here's a diff of what I was thinking for your review either way
diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index bbd40617..d7762e23 100755
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -892,12 +892,12 @@ class DataSourceAzure(sources.DataSource):
logger_func=LOG.info)
return
- LOG.info("Attempting to bring %s up", ifname)
+ LOG.debug("Attempting to bring %s up", ifname)
attempts = 0
+ LOG.debug("Unbinding and binding the interface %s", ifname)
while True:
- LOG.info("Unbinding and binding the interface %s", ifname)
devicename = net.read_sys_net(ifname,
'device/device_id').strip('{}')
util.write_file('/sys/bus/vmbus/drivers/hv_netvsc/unbind',
@@ -912,12 +912,10 @@ class DataSourceAzure(sources.DataSource):
report_diagnostic_event(msg, logger_func=LOG.info)
return
- msg = ("Link is not up after %d attempts to rebind" % attempts)
-
if attempts % 10 == 0:
+ msg = ("Link is not up after %d attempts to rebind" % attempts)
report_diagnostic_event(msg, logger_func=LOG.info)
- else:
- LOG.info(msg)
+ LOG.debug(msg)
# It could take some time after rebind for the interface to be up.
# So poll for the status for some time before attempting to rebind
@@ -929,6 +927,7 @@ class DataSourceAzure(sources.DataSource):
msg = ("After %d attempts to rebind, link is up after "
"polling the link status %d times" % (attempts, i))
report_diagnostic_event(msg, logger_func=LOG.info)
+ LOG.debug(msg)
return
else:
sleep(sleep_duration)|
Also please run |
blackboxsw
left a comment
There was a problem hiding this comment.
LGTM! Thanks @aswinrajamannar. I'll land this after a CI pass. I validated typical deploy case that we don't degrade expected behavior. Most of the backplane hot-add, wait for link up logic isn't something I think I can instrument test outside of internal access to Azure's backplane, but the logic change is simple enough it shouldn't cause any hiccups.
Proposed Commit Message
Azure: During primary nic detection, check interface status with sleep before rebinding again
Summary:
This is a follow up to #972. After A/B testing in Azure and getting inputs from networking experts, we realized the best way to detect if nic is up after netvsc unbind/bind is to check continuously for a while instead of rebinding immediately after 1 second. This change ensures that after a rebind, we check the link status every 500ms for 10s before trying rebind again. Validations in Azure confirm that this is the best way to bring the link up fast instead of continuously rebinding in a loop which could result in failures.
Test Steps
Checklist: