Skip to content

Do not make the liveness probe fail in case of network issue — Part #2#7605

Merged
L3n41c merged 26 commits intomainfrom
lenaic/no_liveness_depending_on_network
Jun 23, 2021
Merged

Do not make the liveness probe fail in case of network issue — Part #2#7605
L3n41c merged 26 commits intomainfrom
lenaic/no_liveness_depending_on_network

Conversation

@L3n41c
Copy link
Copy Markdown
Member

@L3n41c L3n41c commented Mar 8, 2021

What does this PR do?

Do not make the liveness probe fail in case of network issue.
This is a follow-up of #7579.

  • Add a debug log that shows all the goroutine stacktraces when a component is unhealthy to help diagnosing where a goroutine is stuck instead of popping the health probe.
  • Add a context.Context parameter to all the KubeUtil method.
  • Enhance the health framework to push in ticks the deadline a component shouldn’t miss to not be considered as unhealthy.
  • Ensure the deadline given by the health framework is propagated to the KubeUtil for the ad-kubeletlistener component to avoid having the goroutine stuck there:
2021-03-08 10:56:57 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-kubeletlistener]
2021-03-08 10:56:58 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 309 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000c4d080, 0xc000f62700, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000237ce0, 0xc000f62700, 0x3231b00, 0xc000c35110, 0xc0007e6140, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000237ce0, 0xc000f62700, 0x39e3c40, 0xc000237ce0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc000b75400, 0xc000f62700, 0x35ac9e0, 0x101, 0xc000f62700)
        /usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc000b75400, 0xc000f62700, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc000bf7200, 0x39e1620, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0xc0099ea1f1fbc006, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0008c3ef0, 0xc000bf7200, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0x0, 0x1, 0xc000bf7200)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0008c3ef0, 0xc000bf7200, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0008c3ef0, 0x3a3cea0, 0xc00005a068, 0x360b18e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000622d00, 0x3a3cea0, 0xc00005a068, 0x3, 0x1, 0x39dda60, 0xc000628860, 0x13a2c60)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*PodWatcher).PullChanges(0xc0005cd4c0, 0xc000efdf6c, 0x0, 0x0, 0x3, 0xc0013c9e01)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/podwatcher.go:67 +0x50
github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen.func1(0xc000c8f320)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:120 +0x165
created by github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:105 +0x5c
  • Ensure the deadline given by the health framework is propagated to the KubeUtil for the ad-config-provider-* components to avoid having the goroutine stuck there:
2021-03-08 13:39:45 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-config-provider-kubernetes]
2021-03-08 13:39:46 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 311 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000a67200, 0xc0008d9500, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000198e70, 0xc0008d9500, 0x3231a00, 0xc000a8a8d0, 0xc000de8140, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000198e70, 0xc0008d9500, 0x39e3b80, 0xc000198e70, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc0008c6c80, 0xc0008d9500, 0xf8, 0xc000933701, 0xc0008d9500)
        /usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc0008c6c80, 0xc0008d9500, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0008d9200, 0x39e1560, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0xc009a8286b014262, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0006cdef0, 0xc0008d9200, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0x0, 0x1, 0x7f4c9c4eca40)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0006cdef0, 0xc0008d9200, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0006cdef0, 0x3a3ce60, 0xc00005c068, 0x360b08e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc0001a15f0, 0x3a3ce60, 0xc00005c068, 0x1, 0x50000000322ebc0, 0xffffffffffffffff, 0x0, 0xc00055f288)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/autodiscovery/providers.(*KubeletConfigProvider).Collect(0xc000d777c0, 0xc00055f298, 0xc00055f200, 0xc000d0db08, 0x2665ee6, 0xc000a39ec0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/providers/kubelet.go:57 +0x59
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).collect(0xc000a27e50, 0x0, 0x0, 0x0, 0x1, 0x1, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:121 +0x62
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).poll(0xc000a27e50, 0xc0006f0d80)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:93 +0x2ca
created by github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).start
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:66 +0x14b
  • Ensure the deadline given by the health framework is propagated to the KubeUtil for the metadata-* components to avoid having the goroutine stuck there:
2021-03-17 07:30:27 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [metadata-host]
2021-03-17 07:30:28 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
goroutine 550 [select]:
net/http.(*Transport).getConn(0xc000c39900, 0xc000fba600, 0x0, 0xc000f2eaa0, 0x5, 0xc0014fb248, 0x12, 0x0, 0x0, 0x0, ...)
        /usr/local/go/src/net/http/transport.go:1368 +0x589
net/http.(*Transport).roundTrip(0xc000c39900, 0xc0009d8c00, 0xf8, 0xc000704f01, 0xc0009d8c00)
        /usr/local/go/src/net/http/transport.go:579 +0x7eb
net/http.(*Transport).RoundTrip(0xc000c39900, 0xc0009d8c00, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0009d8b00, 0x39e2820, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0xc00c89e766d35916, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000ca40b0, 0xc0009d8b00, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0x0, 0x1, 0x7f0469ebef50)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000ca40b0, 0xc0009d8b00, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc000ca40b0, 0x3a3e1a0, 0xc000058068, 0x360c0ee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000984820, 0x3a3e1a0, 0xc000058068, 0xc0009ef948, 0xc00015f9b0, 0xc0003c7ee8, 0xffffffffffffffff, 0xffffffffffffffff)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetNodename(0xc000984820, 0x3a3e1a0, 0xc000058068, 0x0, 0x0, 0x0, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:154 +0x45
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetNodeLabels(0xc000f26cf0, 0xc0009efc60, 0xc0009ef9e8)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/node_labels.go:25 +0x68
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetTags(0x322e180, 0xc0009efba0, 0xc0009efc00, 0x3, 0x3)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/tags.go:26 +0x49
github.com/DataDog/datadog-agent/pkg/metadata/host.GetHostTags(0xc0008dc500, 0x3)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host_tags.go:126 +0x6c8
github.com/DataDog/datadog-agent/pkg/metadata/host.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host.go:60 +0xe5
github.com/DataDog/datadog-agent/pkg/metadata/v5.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/v5/v5.go:22 +0x6f
github.com/DataDog/datadog-agent/pkg/metadata.(*HostCollector).Send(0x52ed690, 0xc0007df470, 0x0, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host.go:25 +0x2b
github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector.func1(0xc0006c03f0, 0xc00067bfe0, 0x1a3185c5000, 0x39df120, 0x52ed690, 0x360b490, 0x4)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:100 +0x1a8
created by github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:87 +0x1f9

Motivation

On Kubernetes, a liveness probe failure makes kubernetes restart the container.
As a consequence, the liveness probe mustn’t fail for reasons that are external to the agent.

In particular, if there is a network issue so that the agent cannot talk to the kubelet anymore, restarting the agent will not help fixing the network. So, the liveness probe mustn’t fail for that reason.

Additional Notes

We noticed that, in case of network issue, the liveness probe of the agent was flapping.
Here is the plan to fix that:

  1. short term: disable the liveness probe
  2. mid term (Do not make the liveness probe fail in case of network issue #7579): identify the components whose health probe might fail in case of network issue and remove them from the liveness probe, but keep them for the readiness probe.
  3. long term (this PR): identify in the above-mentioned components where are the calls that are blocking in case of network issue and ensure that they have a time-out shorter than the health check period.

As this PR is quite big, here is a review order suggestion that should give an easier understanding of what this PR does:

Describe your test plan

Start an agent pod and blackhole all the network traffic to/from that pod except the connections from the kubelet to the liveness and readiness probes:

# docker ps | grep 'agent run'
83ff441aeb9b        l3n41c/agent                                        "agent run"              10 minutes ago      Up 10 minutes                           k8s_agent_datadog-agent-docker-p55ls_datadog-agent-helm_083ca219-491b-430d-ac62-67d779902ee1_0

# AGENT_PID=$(docker inspect 83ff441aeb9b --format {{.State.Pid}})

# ip addr show cbr0
4: cbr0: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1460 qdisc htb state UP group default qlen 1000
    link/ether 7e:67:55:3e:ce:6b brd ff:ff:ff:ff:ff:ff
    inet 10.44.6.1/24 brd 10.44.6.255 scope global cbr0
       valid_lft forever preferred_lft forever
# HOST_IP_ON_THE_POD_NET=10.44.6.1

# nsenter -n -t $AGENT_PID iptables -A INPUT -i lo -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -A INPUT -s $HOST_IP_ON_THE_POD_NET -p tcp -m multiport --dport 5555,8126 -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -P INPUT DROP
# nsenter -n -t $AGENT_PID iptables -A OUTPUT -o lo -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -A OUTPUT -d $HOST_IP_ON_THE_POD_NET -p tcp -m multiport --sport 5555,8126 -j ACCEPT
# nsenter -n -t $AGENT_PID iptables -P OUTPUT DROP

Once those commands have been run on the host, the agent pod should remain up and running and the logs should show that a lot of network calls are aborted with a context deadline exceeded error. But the liveness probe should remain healthy.

@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from e23889d to f800592 Compare March 8, 2021 10:45
…d-kubeletlistener` component

The goal is to avoid having the `ad-kubeletlistener` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:

```
2021-03-08 10:56:57 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-kubeletlistener]
2021-03-08 10:56:58 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 309 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000c4d080, 0xc000f62700, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000237ce0, 0xc000f62700, 0x3231b00, 0xc000c35110, 0xc0007e6140, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000237ce0, 0xc000f62700, 0x39e3c40, 0xc000237ce0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc000b75400, 0xc000f62700, 0x35ac9e0, 0x101, 0xc000f62700)
        /usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc000b75400, 0xc000f62700, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc000bf7200, 0x39e1620, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0xc0099ea1f1fbc006, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0008c3ef0, 0xc000bf7200, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0x0, 0x1, 0xc000bf7200)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0008c3ef0, 0xc000bf7200, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0008c3ef0, 0x3a3cea0, 0xc00005a068, 0x360b18e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000622d00, 0x3a3cea0, 0xc00005a068, 0x3, 0x1, 0x39dda60, 0xc000628860, 0x13a2c60)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*PodWatcher).PullChanges(0xc0005cd4c0, 0xc000efdf6c, 0x0, 0x0, 0x3, 0xc0013c9e01)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/podwatcher.go:67 +0x50
github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen.func1(0xc000c8f320)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:120 +0x165
created by github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:105 +0x5c
```
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch 3 times, most recently from a29145c to 186e8a7 Compare March 9, 2021 10:00
…-*` components

The goal is to avoid having the `ad-config-provider-kubernetes` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:

```
2021-03-08 13:39:45 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-config-provider-kubernetes]
2021-03-08 13:39:46 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 311 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000a67200, 0xc0008d9500, 0x0, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000198e70, 0xc0008d9500, 0x3231a00, 0xc000a8a8d0, 0xc000de8140, 0x5)
        /usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
        /usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000198e70, 0xc0008d9500, 0x39e3b80, 0xc000198e70, 0x0)
        /usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc0008c6c80, 0xc0008d9500, 0xf8, 0xc000933701, 0xc0008d9500)
        /usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc0008c6c80, 0xc0008d9500, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0008d9200, 0x39e1560, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0xc009a8286b014262, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0006cdef0, 0xc0008d9200, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0x0, 0x1, 0x7f4c9c4eca40)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0006cdef0, 0xc0008d9200, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0006cdef0, 0x3a3ce60, 0xc00005c068, 0x360b08e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc0001a15f0, 0x3a3ce60, 0xc00005c068, 0x1, 0x50000000322ebc0, 0xffffffffffffffff, 0x0, 0xc00055f288)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/autodiscovery/providers.(*KubeletConfigProvider).Collect(0xc000d777c0, 0xc00055f298, 0xc00055f200, 0xc000d0db08, 0x2665ee6, 0xc000a39ec0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/providers/kubelet.go:57 +0x59
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).collect(0xc000a27e50, 0x0, 0x0, 0x0, 0x1, 0x1, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:121 +0x62
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).poll(0xc000a27e50, 0xc0006f0d80)
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:93 +0x2ca
created by github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).start
        /home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:66 +0x14b
```
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from 186e8a7 to 58d016f Compare March 9, 2021 10:55
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch 2 times, most recently from dc77c73 to 44e3b35 Compare March 19, 2021 08:37
…etadata-*` components

The goal is to avoid having the `metadata-host` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:

```
2021-03-17 07:30:27 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [metadata-host]
2021-03-17 07:30:28 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
goroutine 550 [select]:
net/http.(*Transport).getConn(0xc000c39900, 0xc000fba600, 0x0, 0xc000f2eaa0, 0x5, 0xc0014fb248, 0x12, 0x0, 0x0, 0x0, ...)
        /usr/local/go/src/net/http/transport.go:1368 +0x589
net/http.(*Transport).roundTrip(0xc000c39900, 0xc0009d8c00, 0xf8, 0xc000704f01, 0xc0009d8c00)
        /usr/local/go/src/net/http/transport.go:579 +0x7eb
net/http.(*Transport).RoundTrip(0xc000c39900, 0xc0009d8c00, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a)
        /usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0009d8b00, 0x39e2820, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0xc00c89e766d35916, 0x1, 0x0)
        /usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000ca40b0, 0xc0009d8b00, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0x0, 0x1, 0x7f0469ebef50)
        /usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000ca40b0, 0xc0009d8b00, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        /usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc000ca40b0, 0x3a3e1a0, 0xc000058068, 0x360c0ee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000984820, 0x3a3e1a0, 0xc000058068, 0xc0009ef948, 0xc00015f9b0, 0xc0003c7ee8, 0xffffffffffffffff, 0xffffffffffffffff)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetNodename(0xc000984820, 0x3a3e1a0, 0xc000058068, 0x0, 0x0, 0x0, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:154 +0x45
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetNodeLabels(0xc000f26cf0, 0xc0009efc60, 0xc0009ef9e8)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/node_labels.go:25 +0x68
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetTags(0x322e180, 0xc0009efba0, 0xc0009efc00, 0x3, 0x3)
        /home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/tags.go:26 +0x49
github.com/DataDog/datadog-agent/pkg/metadata/host.GetHostTags(0xc0008dc500, 0x3)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host_tags.go:126 +0x6c8
github.com/DataDog/datadog-agent/pkg/metadata/host.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host.go:60 +0xe5
github.com/DataDog/datadog-agent/pkg/metadata/v5.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/v5/v5.go:22 +0x6f
github.com/DataDog/datadog-agent/pkg/metadata.(*HostCollector).Send(0x52ed690, 0xc0007df470, 0x0, 0x0)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host.go:25 +0x2b
github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector.func1(0xc0006c03f0, 0xc00067bfe0, 0x1a3185c5000, 0x39df120, 0x52ed690, 0x360b490, 0x4)
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:100 +0x1a8
created by github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector
        /home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:87 +0x1f9
```
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from 44e3b35 to 56d7729 Compare March 19, 2021 09:38
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch 3 times, most recently from 02ccf50 to adaac09 Compare March 26, 2021 17:55
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from adaac09 to ec9f718 Compare March 29, 2021 07:52
…istening`, `ad-dockerlistener` and `ad-dockerprovider` components
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from 28b589f to 27ba364 Compare March 30, 2021 13:14
Comment on lines +27 to +39
GetEntity() string // unique entity name
GetTaggerEntity() string // tagger entity name
GetADIdentifiers(context.Context) ([]string, error) // identifiers on which templates will be matched
GetHosts(context.Context) (map[string]string, error) // network --> IP address
GetPorts(context.Context) ([]ContainerPort, error) // network ports
GetTags() ([]string, string, error) // tags and tags hash
GetPid(context.Context) (int, error) // process identifier
GetHostname(context.Context) (string, error) // hostname.domainname for the entity
GetCreationTime() integration.CreationTime // created before or after the agent start
IsReady(context.Context) bool // is the service ready
GetCheckNames(context.Context) []string // slice of check names defined in kubernetes annotations or docker labels
HasFilter(containers.FilterType) bool // whether the service is excluded by metrics or logs exclusion config
GetExtraConfig([]byte) ([]byte, error) // Extra configuration values
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As most of the implementations of Service do not use context.Context, I am wondering if it would be possible to remove context.Context and replace it by a Cancel method. The Cancel method could cancel the last request if any. You can also propagate the information with a channel when a request timeout if you also need this information.

Note: You can create 2 interfaces to avoid a service to implement a empty cancel method when the requests are not cancellable :

type Service interface {
	GetEntity() string
}

type CancellableService interface {
	Service
	Cancel()
}

When you have an instance of Service you can dynamically check whether the instance also implement CancellableService and call Cancel only when needed.

@L3n41c L3n41c modified the milestones: 7.28.0, 7.29.0 Apr 16, 2021
@L3n41c L3n41c modified the milestones: 7.29.0, 7.30.0 May 28, 2021
@L3n41c L3n41c requested a review from a team as a code owner June 15, 2021 15:38
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from dcd6798 to af49b5b Compare June 16, 2021 07:47
…o_liveness_depending_on_network

Conflicts:
      pkg/autodiscovery/listeners/docker.go
      pkg/autodiscovery/providers/docker.go
      pkg/autodiscovery/providers/kubelet.go
      pkg/autodiscovery/providers/providers.go
      pkg/clusteragent/orchestrator/status.go
      pkg/collector/corechecks/cluster/orchestrator/orchestrator.go
      pkg/metadata/host/host.go
      pkg/process/checks/container.go
      pkg/process/config/config.go
      pkg/tagger/collectors/docker_main.go
      pkg/tagger/collectors/ecs_extract.go
      pkg/tagger/collectors/ecs_extract_test.go
      pkg/tagger/collectors/ecs_main.go
      pkg/tagger/local/tagger.go
      pkg/util/azure/azure.go
      pkg/util/azure/azure_test.go
      pkg/util/docker/docker_util.go
      pkg/util/ec2/ec2_tags.go
      pkg/util/ecs/common.go
      pkg/util/gce/gce.go
      pkg/util/gce/gce_test.go
@L3n41c L3n41c force-pushed the lenaic/no_liveness_depending_on_network branch from af49b5b to 46531e0 Compare June 16, 2021 09:10
L3n41c and others added 4 commits June 16, 2021 11:37
Co-authored-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com>
…DataDog/datadog-agent into lenaic/no_liveness_depending_on_network
@L3n41c L3n41c requested review from clamoriniere and ogaca-dd June 17, 2021 14:04
Copy link
Copy Markdown
Contributor

@mfpierre mfpierre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good for @DataDog/container-app

defaultOffsetThreshold := 60

defaultHosts := util.GetCloudProviderNTPHosts()
defaultHosts := util.GetCloudProviderNTPHosts(context.TODO())
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the plan for removing this TODO?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO() is in a core check.
Unfortunately, today, there is no timeout for checks execution.
There are some checks which are known to last several minutes like the KSM or SNMP ones for ex, but we cannot know for the custom checks written by our users.
Introducing a time-out for check execution would be a breaking change that is not in the scope of this work.

As a consequence, we could imagine to remove this TODO() and to add a context.Context parameter to this function in order to move the context.Context creation upper in the function call tree.
But anyhow, the context.Context created in the scope of a (core) check will be an “empty” context (either context.Background() or context.TODO(), so, it wouldn’t change the behaviour.

The code could look better if all the context.Context creation were moved as high as possible in the function call tree but I though I would do it in another PR because this one already implements the context.Context propagation in the scopes where a deadline can be defined (i.e.: event loops monitored by health probe, but not in checks, nor in interactive agent CLI commands)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

The code could look better if all the context.Context creation were moved as high as possible in the function call tree but I though I would do it in another PR because this one already implements the context.Context propagation in the scopes where a deadline can be defined

that makes sense, let's do it in another PR

Also, it would be good IMO to track TODOs somewhere (at least for the ones that are not going to be just replaced by Background)

@L3n41c L3n41c merged commit be59410 into main Jun 23, 2021
@L3n41c L3n41c deleted the lenaic/no_liveness_depending_on_network branch June 23, 2021 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants