Skip to content

Do not make the liveness probe fail in case of network issue#7579

Merged
L3n41c merged 1 commit intomasterfrom
lenaic/no_liveness_depending_on_network
Mar 4, 2021
Merged

Do not make the liveness probe fail in case of network issue#7579
L3n41c merged 1 commit intomasterfrom
lenaic/no_liveness_depending_on_network

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Mar 3, 2021

What does this PR do?

Do not make the liveness probe fail in case of network issue.

Motivation

On Kubernetes, a liveness probe failure makes kubernetes restart the container.
As a consequence, the liveness probe mustn’t fail for reasons that are external to the agent.

In particular, if there is a network issue so that the agent cannot talk to the kubelet anymore, restarting the agent will not help fixing the network. So, the liveness probe mustn’t fail for that reason.

Additional Notes

We noticed that, in case of network issue, the liveness probe of the agent was flapping.
Here is the plan to fix that:

  1. short term: disable the liveness probe
  2. mid term (this PR): identify the components whose health probe might fail in case of network issue and remove them from the liveness probe, but keep them for the readiness probe.
  3. longer term (in another PR): identify in the above-mentioned components where are the calls that are blocking in case of network issue and ensure that they have a time-out shorter than the health check period.

Describe your test plan

Start an agent pod and blackhole all the network traffic to/from that pod except the connections from the kubelet to the liveness and readiness probes:

# docker ps | grep 'agent run'
83ff441aeb9b        l3n41c/agent                                        "agent run"              10 minutes ago      Up 10 minutes                           k8s_agent_datadog-agent-docker-p55ls_datadog-agent-helm_083ca219-491b-430d-ac62-67d779902ee1_0

# AGENT_PID=$(docker inspect 83ff441aeb9b --format {{.State.Pid}})

# ip addr show cbr0
4: cbr0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1460 qdisc htb state UP group default qlen 1000
    link/ether 7e:67:55:3e:ce:6b brd ff:ff:ff:ff:ff:ff
    inet 10.44.6.1/24 brd 10.44.6.255 scope global cbr0
       valid_lft forever preferred_lft forever
# HOST_IP_ON_THE_POD_NET=10.44.6.1

# nsenter -n -t $AGENT_PID iptables -A INPUT -i lo -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -A INPUT -s $HOST_IP_ON_THE_POD_NET -p tcp -m multiport --dport 5555,8126 -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -P INPUT DROP
# nsenter -n -t $AGENT_PID iptables -A OUTPUT -o lo -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -A OUTPUT -d $HOST_IP_ON_THE_POD_NET -p tcp -m multiport --sport 5555,8126 -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -P OUTPUT DROP

Once those commands have been run on the host, after a while (depending on the liveness probe delay and #of errors), the agent pod should:

  • restart with master and the events should show that the liveness probe failed.
  • remain up and running with this change and the events should show that only the readiness probe failed.

@L3n41c L3n41c added this to the 7.27.0 milestone Mar 3, 2021
@L3n41c L3n41c requested review from a team as code owners March 3, 2021 21:15
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from 6284408 to 78e96f0 Compare March 4, 2021 11:30
Copy link
Copy Markdown
Contributor

@ahmed-mez ahmed-mez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes in pkg/autodiscovery/config_poller.go will be applied to all the config providers (kubelet, docker, ecs) while for the listeners you only updated kubelet, should we apply the same change in docker and ecs listeners ?

@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from 78e96f0 to e3b2d29 Compare March 4, 2021 14:14
@L3n41c L3n41c requested a review from ahmed-mez March 4, 2021 14:20
@L3n41c L3n41c merged commit 999ee1d into master Mar 4, 2021
@L3n41c L3n41c deleted the lenaic/no_liveness_depending_on_network branch March 4, 2021 21:00
L3n41c added a commit that referenced this pull request Mar 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants