Azure parse_network_config uses fallback cfg when generate IMDS network cfg fails#549
Conversation
…lure to generate network config from IMDS
|
@anhvoms @Moustafa-Moustafa @trstringer This prevents invalid/corrupted IMDS network metadata from causing provisioning as a whole to fail. |
|
@johnsonshi thanks for this PR, could you attach as a comment the response of cloud-init query -all (I'm specifically interested in the network section as surfaced by IMDS on one of these failed nodes). If reproducible |
blackboxsw
left a comment
There was a problem hiding this comment.
One general question I have is if IMDS is in a state that cloud-init can't parse is IMDS network config content recoverable? As in, would retries buy us anything in these cases?
I'm afraid not. The IMDS inconsistencies and delays aren't deterministic, and it's impossible to "determine" whether the complete network metadata returned is even complete or not. The root cause of these inconsistencies is due to a platform issue. |
We discovered these by looking at the |
blackboxsw
left a comment
There was a problem hiding this comment.
Thanks @johnsonshi sorry for the delay here on response. I think we can drop that UT you mentioned and please update the pull request description to make note of the additional mlx5_core driver blacklist functionality on fallback config.
We'll be using the pull request description as your squashed merge commit for this PR when this lands.
|
Apologies I've got a couple of high priority items so this has been delayed. |
|
@blackboxsw I've updated the PR description + added the comments on the VM instance SKUs with |
blackboxsw
left a comment
There was a problem hiding this comment.
Excellent @johnsonshi thanks for the additions. I've launched and upgraded to this version cloud-init and see correct network configuration emitted and network properly setup on non-infiniband eth0 device.
ubuntu@SRU-worked-azure:~$ grep Trace /var/log/cloud-init.log
ubuntu@SRU-worked-azure:~$ cloud-init status --long
status: done
time: Thu, 24 Sep 2020 16:43:35 +0000
detail:
DataSourceAzure [seed=/var/lib/waagent]
ubuntu@SRU-worked-azure:~$ cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
ethernets:
eth0:
dhcp4: true
dhcp4-overrides: &id001
route-metric: 100
dhcp6: true
dhcp6-overrides: *id001
match:
driver: hv_netvsc
macaddress: 00:0d:3a:e2:d9:0e
set-name: eth0
eth1:
dhcp4: true
dhcp4-overrides: &id002
route-metric: 200
dhcp6: true
dhcp6-overrides: *id002
match:
driver: hv_netvsc
macaddress: 00:0d:3a:e2:de:17
set-name: eth1
version: 2
ubuntu@SRU-worked-azure:~$ ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
inet 10.0.0.4/24 brd 10.0.0.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 ace:cab:deca:deed::4/128 scope global dynamic noprefixroute
valid_lft 17279947sec preferred_lft 8639947sec
inet6 fe80::20d:3aff:fee2:d90e/64 scope link
valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff
inet 10.0.0.5/24 brd 10.0.0.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::20d:3aff:fee2:de17/64 scope link
valid_lft forever preferred_lft forever
4: rename4: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth0 state UP group default qlen 1000
link/ether 00:0d:3a:e2:d9:0e brd ff:ff:ff:ff:ff:ff
5: rename5: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq master eth1 state UP group default qlen 1000
link/ether 00:0d:3a:e2:de:17 brd ff:ff:ff:ff:ff:ff
Azure datasource's
parse_network_configthrows a fatal uncaught exception when an exception is raised during generation of network config from IMDS metadata. This happens when IMDS metadata is invalid/corrupted (such as when it is missing network or interface metadata). This causes the rest of provisioning to fail.This changes
parse_network_configto be a non-fatal implementation. Additionally, when generating network config from IMDS metadata fails, fall back on generating fallback network config (_generate_network_config_from_fallback_config).This also changes fallback network config generation (
_generate_network_config_from_fallback_config) to blacklist an additional driver:mlx5_core.