Skip to content
This repository was archived by the owner on Oct 24, 2023. It is now read-only.

fix: working systemd monitor jobs#3788

Merged
jackfrancis merged 21 commits intoAzure:masterfrom
jackfrancis:systemd-monitor
Sep 14, 2020
Merged

fix: working systemd monitor jobs#3788
jackfrancis merged 21 commits intoAzure:masterfrom
jackfrancis:systemd-monitor

Conversation

@jackfrancis
Copy link
Member

Reason for Change:

This PR updates the implementation of the various systemd monitor jobs so that the following critical services are monitored for failure, and restarted:

  • docker
  • containerd
  • etcd
  • kubelet

Issue Fixed:

Requirements:

Notes:

@codecov
Copy link

codecov bot commented Sep 4, 2020

Codecov Report

Merging #3788 into master will increase coverage by 0.00%.
The diff coverage is 93.75%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #3788   +/-   ##
=======================================
  Coverage   73.19%   73.20%           
=======================================
  Files         148      148           
  Lines       25394    25403    +9     
=======================================
+ Hits        18587    18596    +9     
  Misses       5671     5671           
  Partials     1136     1136           
Impacted Files Coverage Δ
pkg/engine/templates_generated.go 53.42% <77.77%> (ø)
pkg/api/defaults-kubelet.go 96.82% <100.00%> (+0.01%) ⬆️
pkg/engine/armvariables.go 86.47% <100.00%> (-0.03%) ⬇️
pkg/engine/template_generator.go 82.40% <100.00%> (+0.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 735f843...aba419c. Read the comment docs.

@jackfrancis jackfrancis force-pushed the systemd-monitor branch 2 times, most recently from 089b5ab to 21098b8 Compare September 4, 2020 23:29
{
"name": "kubernetes-dashboard",
"enabled": true
"enabled": false
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are temporary changes while we work on reducing customData size

NODE_INDEX=$(hostname | tail -c 2)
NODE_NAME=$(hostname)
PRIVATE_IP=$(hostname -I | cut -d' ' -f1)
PRIVATE_IP=$(hostname -i | cut -d' ' -f1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear to me why we're using -I (get me all interfaces) vs -i get me my primary interface...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the ip address here (-i) is a resolved IP address based on the host name. Is that what you want?

The hostname -I returns all ip addresses without DNS resolution. This means you get things like loopback, etc.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I would look into another way to get the right IP address for the master (like, it must be known somewhere already)

@@ -1,7 +1,7 @@
[Unit]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate we eliminate this timer spec. It adds complexity to the overall implementation, and the way we're implementing the docker health check (docker ps) should not be "racy" given that the health check script runs only after the docker systemd service has started (see After=docker.service above)

@Michael-Sinz @mboersma thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we're implicitly saying by delaying things for 30 mins is that we are tolerant of docker ps repeatedly failing during the first 30 mins of boot time, which I don't think is defensible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think providing a warmup period for Docker to get its act together was basically what Azure/acs-engine#4050 was about. I agree simpler is way better when it comes to systemd units especially, so l'm ok with removing it to see if it's unneeded now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further investigation, it is a bit tricky to tell a systemd service "wait until this service starts, and also until it is fully activated; and so I've moved the delay into the health script itself. Arguably that is less complicated than maintaining systemd overhead to do that.

{{- end}}
systemctlEnableAndStart kubelet || exit {{GetCSEErrorCode "ERR_KUBELET_START_FAIL"}}
wait_for_file 1200 1 /etc/systemd/system/kubelet-monitor.service || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
systemctlEnableAndStart kubelet-monitor || exit {{GetCSEErrorCode "ERR_KUBELET_START_FAIL"}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (continuing along the systemd relationship conversation), we are as a rule blocking the startup of the monitor jobs on the jobs themselves starting successfully, as exemplified by this kubelet-monitor start operation

DOCKER_VERSION=1.13.1-1
NVIDIA_CONTAINER_RUNTIME_VER=2.0.0
NVIDIA_DOCKER_SUFFIX=docker18.09.2-1
PRIVATE_IP=$(ip -4 addr show eth0 | grep -Po '(?<=inet )[\d.]+')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new, common "what is my private IP address?" implementation to be shared across the various places where that runtime determination is needed.

eth0 always has this information at first boot; in an Azure CNI configuration, an "azure0" bridge interface is set up with the primary NIC IP address (and the eth0 interface no longer containes it)

In the edge case scenario where none of those exists, we fall back to the IP address that the hostname resolves to. The reason we don't want to do that primarily is because we don't want to rely upon DNS lookups.

cc @Michael-Sinz

#!/bin/bash
NODE_INDEX=$(hostname | tail -c 2)
NODE_NAME=$(hostname)
PRIVATE_IP=$(hostname -I | cut -d' ' -f1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This var assignment has been moved into the cse_helpers.sh instead for more general purpose usage, and the etcd vars are moved down to local vars inside the funcs that use them

sysctl_reload 10 5 120 || exit {{GetCSEErrorCode "ERR_SYSCTL_RELOAD"}}
wait_for_file 1200 1 /etc/default/kubelet || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
wait_for_file 1200 1 /var/lib/kubelet/kubeconfig || exit {{GetCSEErrorCode "ERR_FILE_WATCH_TIMEOUT"}}
if [[ -n ${MASTER_NODE} ]]; then
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one-time VMSS master-specific foo has been moved into the bootstrap script, and out of the shell script that runs every time that the kubelet systemd service starts.

DOCKER_VERSION=1.13.1-1
NVIDIA_CONTAINER_RUNTIME_VER=2.0.0
NVIDIA_DOCKER_SUFFIX=docker18.09.2-1
PRIVATE_IP=$( (ip -br -4 addr show eth0 || ip -br -4 addr show azure0) | grep -Po '\d+\.\d+\.\d+\.\d+')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code does the following:

  1. Get me the IP addresses on eth0
  2. Or, if there aren't any, get me the IP addresses on azure0
  3. If there isn't exactly 1 IP address from trying those, then just get the IP address that the hostname DNS entry returns

#3 is undesirable as it relies upon functional DNS, so we only do it in a fallback scenario

We consider the possibility that eth0 won't have the desired IP address in order to make this solution resilient during the lifecycle of the VM: in Azure CNI configuration scenarios the IP address will be attached to the "azure0" interface when the first (non-hostNetwork) pod is scheduled onto the node that the VM represents (that's how Azure CNI routes container traffic out of the VM).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which one is priority? The azure0 or the eth0?
If azure0 is there then eth0 should not be, then would the order be better first check azure0 and then eth0?

(Unclear from your comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we define the VM in the ARM template, we declare a primary IP address on a single NIC, so we expect it to be reflected in eth0. Because Azure CNI creates a bridge "azure0" interface and "takes over" the IP address from eth0, that's where we have to look (if we want to look locally and not rely upon DNS) for the primary IP address of the host.

An additional consideration is that Ubuntu 18.04-LTS's cloud-init implementation has a bug in that it doesn't respect the ARM template-declared primary IP address when the spec also includes secondary IP addresses, and so eth0 will have more than one IP address (those secondary IP addresses are for Azure CNI to use in the container networking layer, not the host layer, so eth0 should only ever have 1 IP address). To deal with that edge case (e.g., 18.04-LTS on the first boot, before the cloud-init network config override has been applied) we evaluate the number of IP addresses returned (the | grep -c '^' part), and if it's not exactly 1, then we just fallback to relying upon DNS.

Restart=always
RestartSec=10
RemainAfterExit=yes
Environment=CONTAINER_RUNTIME={{GetContainerRuntime}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service is now overloaded to support both moby or containerd

echo "" >> /etc/environment
fi
{{- if IsMasterVirtualMachineScaleSets}}
source {{GetCSEHelpersScriptFilepath}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This source statement gets us the $PRIVATE_IP var we need

env azure.Environment
azureClient *armhelpers.AzureClient
firstMasterRegexStr = fmt.Sprintf("^%s-", common.LegacyControlPlaneVMPrefix)
firstMasterRegexStr = fmt.Sprintf("^%s-.*-0", common.LegacyControlPlaneVMPrefix)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look hard enough you find interesting things

Copy link
Member

@mboersma mboersma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot acs-bot added the lgtm label Sep 14, 2020
@jackfrancis jackfrancis merged commit c670571 into Azure:master Sep 14, 2020
@jackfrancis jackfrancis deleted the systemd-monitor branch September 14, 2020 22:08
@acs-bot
Copy link

acs-bot commented Sep 14, 2020

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, mboersma

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [jackfrancis,mboersma]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

penggu pushed a commit to penggu/aks-engine that referenced this pull request Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants