Skip to content

Azure: During primary nic detection, check interface status continuously before rebinding again#990

Merged
blackboxsw merged 4 commits into
canonical:mainfrom
aswinrajamannar:multinicsleepfix
Aug 20, 2021
Merged

Azure: During primary nic detection, check interface status continuously before rebinding again#990
blackboxsw merged 4 commits into
canonical:mainfrom
aswinrajamannar:multinicsleepfix

Conversation

@aswinrajamannar
Copy link
Copy Markdown
Contributor

@aswinrajamannar aswinrajamannar commented Aug 20, 2021

Proposed Commit Message

Azure: During primary nic detection, check interface status with sleep before rebinding again

Summary:

This is a follow up to #972. After A/B testing in Azure and getting inputs from networking experts, we realized the best way to detect if nic is up after netvsc unbind/bind is to check continuously for a while instead of rebinding immediately after 1 second. This change ensures that after a rebind, we check the link status every 500ms for 10s before trying rebind again. Validations in Azure confirm that this is the best way to bring the link up fast instead of continuously rebinding in a loop which could result in failures.

Test Steps

  • Deploy a savable PPS VM in Azure (repro is internal to Azure)
  • Reuse the VM with multiple NICs. Without this fix, a small number of the VMs could fail because cloudinit keeps rebinding the interface in attempts to bring the link up.

Checklist:

  • My code follows the process laid out in the documentation
  • I have updated or added any unit tests accordingly
  • I have updated or added any documentation accordingly

@aswinrajamannar aswinrajamannar changed the title Modify sleep logic in primary nic detection based on observation Azure: During primary nic detection, check interface status continuously before rebinding again Aug 20, 2021
Comment thread cloudinit/sources/DataSourceAzure.py
Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content and retries look good. I guess I'd like some sort of debug log breadcrumb that we are polling so we know where we are spending time.

Co-authored-by: Chad Smith <chad.smith@canonical.com>
sleep_duration = 0.5
max_status_polls = 20
LOG.debug("Polling %d seconds for primary NIC link up after rebind.", sleep_duration * max_status_polls)
for i in range(0, max_status_polls):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswinrajamannar for the quick fix, given that this polling on networking.is_up seems to be generally helpful, do you think it would make sense to either define a local function to wrap self.distro-networking.is_up and the polling logic into a simple function that we can also call above before the while True loop on the initial self.distro.networking.try_set_link_up check?

Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, maybe this retry logic doesn't really payoff on the first call to self.distro.networking.try_set_link_up in wait_for_link_up and only adds 10 seconds of delay to the whole process because we expect wait_for_link_up to succeed immediately. If first linkup attempt fails, then we do a deeper unbind/bind loop with retries

Copy link
Copy Markdown
Contributor Author

@aswinrajamannar aswinrajamannar Aug 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polling is not needed before the loop. The polling is necessary here mainly because we wrote to hv_netvsc/unbind and hv_netvsc/bind, which is forcing the driver to refresh the state of the link. After doing the unbind & bind, it seems to take a few seconds for the link to be actually up that's why we wait for is_up before calling unbind & bind again.

In most of the cases, the link would be up after just one call to try_set_link_up so we don't need to waste time doing unbind and bind. That is why it is added before the while loop. If that first check doesn't bring the link up, simply waiting longer without doing unbind and bind is not going to help and it would be an unnecessary delay. Hope this clarifies a little bit..? Or let me know if I misunderstood your comment.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 this general polling looks like it could be reasonable in for more general use-cases down in https://github.com/canonical/cloud-init/blob/main/cloudinit/distros/networking.py#L226 if we were to add an optional "retries" parameter or something that if set, we'd retry the check on networking.is_up for however many retries are provided. I don't think this needs to be fixed in this branch though, but it might be something we request on a followup PR.

Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aswinrajamannar. Things look pretty good as far as logic layout. I have a couple of minor nits that you can feel free to disagree with.

  1. Can we tune down the general log level from LOG.info to LOG.debug for things that are not important for the typical person to consume or react to? General informational messages that don't require human-reaction could generally be classified by cloud-init as debug messages. Things like inteded configuration breaking or being ignored can qualify as LOG.info messages .
  2. Also along this lines of too much logging noise, do we really want to LOG.debug every single retry of the " Link is not up after %d attempts to rebind" ? It seems we have bookend logs that will do the trick telling us that we are starting to unbind/rebind and the number or attempts until success.
    Here's a diff of what I was thinking for your review either way
diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index bbd40617..d7762e23 100755
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -892,12 +892,12 @@ class DataSourceAzure(sources.DataSource):
                                     logger_func=LOG.info)
             return
 
-        LOG.info("Attempting to bring %s up", ifname)
+        LOG.debug("Attempting to bring %s up", ifname)
 
         attempts = 0
+        LOG.debug("Unbinding and binding the interface %s", ifname)
         while True:
 
-            LOG.info("Unbinding and binding the interface %s", ifname)
             devicename = net.read_sys_net(ifname,
                                           'device/device_id').strip('{}')
             util.write_file('/sys/bus/vmbus/drivers/hv_netvsc/unbind',
@@ -912,12 +912,10 @@ class DataSourceAzure(sources.DataSource):
                 report_diagnostic_event(msg, logger_func=LOG.info)
                 return
 
-            msg = ("Link is not up after %d attempts to rebind" % attempts)
-
             if attempts % 10 == 0:
+                msg = ("Link is not up after %d attempts to rebind" % attempts)
                 report_diagnostic_event(msg, logger_func=LOG.info)
-            else:
-                LOG.info(msg)
+                LOG.debug(msg)
 
             # It could take some time after rebind for the interface to be up.
             # So poll for the status for some time before attempting to rebind
@@ -929,6 +927,7 @@ class DataSourceAzure(sources.DataSource):
                     msg = ("After %d attempts to rebind, link is up after "
                            "polling the link status %d times" % (attempts, i))
                     report_diagnostic_event(msg, logger_func=LOG.info)
+                    LOG.debug(msg)
                     return
                 else:
                     sleep(sleep_duration)

@blackboxsw
Copy link
Copy Markdown
Collaborator

Also please run tox -p auto or minimally tox -e flake8 as that first log line I pasted inline generated a flake error cloudinit/sources/DataSourceAzure.py:927:80: E501 line too long (116 > 79 characters)

Copy link
Copy Markdown
Collaborator

@blackboxsw blackboxsw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @aswinrajamannar. I'll land this after a CI pass. I validated typical deploy case that we don't degrade expected behavior. Most of the backplane hot-add, wait for link up logic isn't something I think I can instrument test outside of internal access to Azure's backplane, but the logic change is simple enough it shouldn't cause any hiccups.

@blackboxsw blackboxsw merged commit 3ec8ddd into canonical:main Aug 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants