Azure: During primary nic detection, check interface status continuously before rebinding again by aswinrajamannar · Pull Request #990 · canonical/cloud-init

aswinrajamannar · 2021-08-20T19:30:22Z

Proposed Commit Message

Azure: During primary nic detection, check interface status with sleep before rebinding again

Summary:

This is a follow up to #972. After A/B testing in Azure and getting inputs from networking experts, we realized the best way to detect if nic is up after netvsc unbind/bind is to check continuously for a while instead of rebinding immediately after 1 second. This change ensures that after a rebind, we check the link status every 500ms for 10s before trying rebind again. Validations in Azure confirm that this is the best way to bring the link up fast instead of continuously rebinding in a loop which could result in failures.

Test Steps

Deploy a savable PPS VM in Azure (repro is internal to Azure)
Reuse the VM with multiple NICs. Without this fix, a small number of the VMs could fail because cloudinit keeps rebinding the interface in attempts to bring the link up.

Checklist:

My code follows the process laid out in the documentation
I have updated or added any unit tests accordingly
I have updated or added any documentation accordingly

blackboxsw

Content and retries look good. I guess I'd like some sort of debug log breadcrumb that we are polling so we know where we are spending time.

Co-authored-by: Chad Smith <chad.smith@canonical.com>

blackboxsw · 2021-08-20T20:52:42Z

+            sleep_duration = 0.5
+            max_status_polls = 20
+            LOG.debug("Polling %d seconds for primary NIC link up after rebind.", sleep_duration * max_status_polls)
+            for i in range(0, max_status_polls):


Thanks @aswinrajamannar for the quick fix, given that this polling on networking.is_up seems to be generally helpful, do you think it would make sense to either define a local function to wrap self.distro-networking.is_up and the polling logic into a simple function that we can also call above before the while True loop on the initial self.distro.networking.try_set_link_up check?

But, maybe this retry logic doesn't really payoff on the first call to self.distro.networking.try_set_link_up in wait_for_link_up and only adds 10 seconds of delay to the whole process because we expect wait_for_link_up to succeed immediately. If first linkup attempt fails, then we do a deeper unbind/bind loop with retries

The polling is not needed before the loop. The polling is necessary here mainly because we wrote to hv_netvsc/unbind and hv_netvsc/bind, which is forcing the driver to refresh the state of the link. After doing the unbind & bind, it seems to take a few seconds for the link to be actually up that's why we wait for is_up before calling unbind & bind again.

In most of the cases, the link would be up after just one call to try_set_link_up so we don't need to waste time doing unbind and bind. That is why it is added before the while loop. If that first check doesn't bring the link up, simply waiting longer without doing unbind and bind is not going to help and it would be an unnecessary delay. Hope this clarifies a little bit..? Or let me know if I misunderstood your comment.

+1 this general polling looks like it could be reasonable in for more general use-cases down in https://github.com/canonical/cloud-init/blob/main/cloudinit/distros/networking.py#L226 if we were to add an optional "retries" parameter or something that if set, we'd retry the check on networking.is_up for however many retries are provided. I don't think this needs to be fixed in this branch though, but it might be something we request on a followup PR.

blackboxsw

Thanks @aswinrajamannar. Things look pretty good as far as logic layout. I have a couple of minor nits that you can feel free to disagree with.

Can we tune down the general log level from LOG.info to LOG.debug for things that are not important for the typical person to consume or react to? General informational messages that don't require human-reaction could generally be classified by cloud-init as debug messages. Things like inteded configuration breaking or being ignored can qualify as LOG.info messages .
Also along this lines of too much logging noise, do we really want to LOG.debug every single retry of the " Link is not up after %d attempts to rebind" ? It seems we have bookend logs that will do the trick telling us that we are starting to unbind/rebind and the number or attempts until success.
Here's a diff of what I was thinking for your review either way

diff --git a/cloudinit/sources/DataSourceAzure.py b/cloudinit/sources/DataSourceAzure.py
index bbd40617..d7762e23 100755
--- a/cloudinit/sources/DataSourceAzure.py
+++ b/cloudinit/sources/DataSourceAzure.py
@@ -892,12 +892,12 @@ class DataSourceAzure(sources.DataSource):
                                     logger_func=LOG.info)
             return
 
-        LOG.info("Attempting to bring %s up", ifname)
+        LOG.debug("Attempting to bring %s up", ifname)
 
         attempts = 0
+        LOG.debug("Unbinding and binding the interface %s", ifname)
         while True:
 
-            LOG.info("Unbinding and binding the interface %s", ifname)
             devicename = net.read_sys_net(ifname,
                                           'device/device_id').strip('{}')
             util.write_file('/sys/bus/vmbus/drivers/hv_netvsc/unbind',
@@ -912,12 +912,10 @@ class DataSourceAzure(sources.DataSource):
                 report_diagnostic_event(msg, logger_func=LOG.info)
                 return
 
-            msg = ("Link is not up after %d attempts to rebind" % attempts)
-
             if attempts % 10 == 0:
+                msg = ("Link is not up after %d attempts to rebind" % attempts)
                 report_diagnostic_event(msg, logger_func=LOG.info)
-            else:
-                LOG.info(msg)
+                LOG.debug(msg)
 
             # It could take some time after rebind for the interface to be up.
             # So poll for the status for some time before attempting to rebind
@@ -929,6 +927,7 @@ class DataSourceAzure(sources.DataSource):
                     msg = ("After %d attempts to rebind, link is up after "
                            "polling the link status %d times" % (attempts, i))
                     report_diagnostic_event(msg, logger_func=LOG.info)
+                    LOG.debug(msg)
                     return
                 else:
                     sleep(sleep_duration)

blackboxsw · 2021-08-20T21:10:54Z

Also please run tox -p auto or minimally tox -e flake8 as that first log line I pasted inline generated a flake error cloudinit/sources/DataSourceAzure.py:927:80: E501 line too long (116 > 79 characters)

blackboxsw

LGTM! Thanks @aswinrajamannar. I'll land this after a CI pass. I validated typical deploy case that we don't degrade expected behavior. Most of the backplane hot-add, wait for link up logic isn't something I think I can instrument test outside of internal access to Azure's backplane, but the logic change is simple enough it shouldn't cause any hiccups.

aswinrajamannar changed the title ~~Modify sleep logic in primary nic detection based on observation~~ Azure: During primary nic detection, check interface status continuously before rebinding again Aug 20, 2021

Modify sleep logic in primary nic detection based on observation

b1cc7d9

aswinrajamannar force-pushed the multinicsleepfix branch from 365c3eb to b1cc7d9 Compare August 20, 2021 19:41

blackboxsw reviewed Aug 20, 2021

View reviewed changes

Comment thread cloudinit/sources/DataSourceAzure.py

blackboxsw reviewed Aug 20, 2021

View reviewed changes

Commit PR comment suggestion

bfe55aa

Co-authored-by: Chad Smith <chad.smith@canonical.com>

blackboxsw reviewed Aug 20, 2021

View reviewed changes

Aswin Rajamannar and others added 2 commits August 20, 2021 14:22

Address logging comments

a0074da

Merge branch 'main' into multinicsleepfix

96ff094

blackboxsw approved these changes Aug 20, 2021

View reviewed changes

blackboxsw merged commit 3ec8ddd into canonical:main Aug 20, 2021

This was referenced May 12, 2023

cloud-init in impish makes /home/ubuntu/.ssh root.root #3894

Closed

Release 21.3 #3898

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Azure: During primary nic detection, check interface status continuously before rebinding again#990

Azure: During primary nic detection, check interface status continuously before rebinding again#990
blackboxsw merged 4 commits into
canonical:mainfrom
aswinrajamannar:multinicsleepfix

aswinrajamannar commented Aug 20, 2021 •

edited

Loading

Uh oh!

Uh oh!

blackboxsw left a comment

Uh oh!

blackboxsw Aug 20, 2021

Uh oh!

blackboxsw Aug 20, 2021 •

edited

Loading

Uh oh!

aswinrajamannar Aug 20, 2021 •

edited

Loading

Uh oh!

blackboxsw Aug 20, 2021

Uh oh!

blackboxsw left a comment

Uh oh!

blackboxsw commented Aug 20, 2021

Uh oh!

blackboxsw left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

aswinrajamannar commented Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Proposed Commit Message

Summary:

Test Steps

Checklist:

Uh oh!

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

blackboxsw Aug 20, 2021

Choose a reason for hiding this comment

Uh oh!

blackboxsw Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aswinrajamannar Aug 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blackboxsw Aug 20, 2021

Choose a reason for hiding this comment

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

blackboxsw commented Aug 20, 2021

Uh oh!

blackboxsw left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aswinrajamannar commented Aug 20, 2021 •

edited

Loading

blackboxsw Aug 20, 2021 •

edited

Loading

aswinrajamannar Aug 20, 2021 •

edited

Loading