Do not make the liveness probe fail in case of network issue — Part #2#7605
Do not make the liveness probe fail in case of network issue — Part #2#7605
Conversation
e23889d to
f800592
Compare
…d-kubeletlistener` component
The goal is to avoid having the `ad-kubeletlistener` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:
```
2021-03-08 10:56:57 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-kubeletlistener]
2021-03-08 10:56:58 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 309 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000c4d080, 0xc000f62700, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000237ce0, 0xc000f62700, 0x3231b00, 0xc000c35110, 0xc0007e6140, 0x5)
/usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
/usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000237ce0, 0xc000f62700, 0x39e3c40, 0xc000237ce0, 0x0)
/usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc000b75400, 0xc000f62700, 0x35ac9e0, 0x101, 0xc000f62700)
/usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc000b75400, 0xc000f62700, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc000bf7200, 0x39e1620, 0xc000b75400, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0xc0099ea1f1fbc006, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0008c3ef0, 0xc000bf7200, 0xc0099ea1f1fbc006, 0x5821272ea6, 0x4d87a00, 0xc00062b0f8, 0x0, 0x1, 0xc000bf7200)
/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0008c3ef0, 0xc000bf7200, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
/usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0008c3ef0, 0x3a3cea0, 0xc00005a068, 0x360b18e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000622d00, 0x3a3cea0, 0xc00005a068, 0x3, 0x1, 0x39dda60, 0xc000628860, 0x13a2c60)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*PodWatcher).PullChanges(0xc0005cd4c0, 0xc000efdf6c, 0x0, 0x0, 0x3, 0xc0013c9e01)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/podwatcher.go:67 +0x50
github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen.func1(0xc000c8f320)
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:120 +0x165
created by github.com/DataDog/datadog-agent/pkg/autodiscovery/listeners.(*KubeletListener).Listen
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/listeners/kubelet.go:105 +0x5c
```
a29145c to
186e8a7
Compare
…-*` components
The goal is to avoid having the `ad-config-provider-kubernetes` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:
```
2021-03-08 13:39:45 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [ad-config-provider-kubernetes]
2021-03-08 13:39:46 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
[…]
goroutine 311 [select]:
net/http.(*http2ClientConn).roundTrip(0xc000a67200, 0xc0008d9500, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/h2_bundle.go:7668 +0x9c5
net/http.(*http2Transport).RoundTripOpt(0xc000198e70, 0xc0008d9500, 0x3231a00, 0xc000a8a8d0, 0xc000de8140, 0x5)
/usr/local/go/src/net/http/h2_bundle.go:6981 +0x1a5
net/http.(*http2Transport).RoundTrip(...)
/usr/local/go/src/net/http/h2_bundle.go:6942
net/http.http2noDialH2RoundTripper.RoundTrip(0xc000198e70, 0xc0008d9500, 0x39e3b80, 0xc000198e70, 0x0)
/usr/local/go/src/net/http/h2_bundle.go:9197 +0x3e
net/http.(*Transport).roundTrip(0xc0008c6c80, 0xc0008d9500, 0xf8, 0xc000933701, 0xc0008d9500)
/usr/local/go/src/net/http/transport.go:537 +0xdec
net/http.(*Transport).RoundTrip(0xc0008c6c80, 0xc0008d9500, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0008d9200, 0x39e1560, 0xc0008c6c80, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0xc009a8286b014262, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc0006cdef0, 0xc0008d9200, 0xc009a8286b014262, 0x3188961bbb, 0x4d87a00, 0xc000b522b0, 0x0, 0x1, 0x7f4c9c4eca40)
/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc0006cdef0, 0xc0008d9200, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
/usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc0006cdef0, 0x3a3ce60, 0xc00005c068, 0x360b08e, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc0001a15f0, 0x3a3ce60, 0xc00005c068, 0x1, 0x50000000322ebc0, 0xffffffffffffffff, 0x0, 0xc00055f288)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/autodiscovery/providers.(*KubeletConfigProvider).Collect(0xc000d777c0, 0xc00055f298, 0xc00055f200, 0xc000d0db08, 0x2665ee6, 0xc000a39ec0)
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/providers/kubelet.go:57 +0x59
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).collect(0xc000a27e50, 0x0, 0x0, 0x0, 0x1, 0x1, 0x0)
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:121 +0x62
github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).poll(0xc000a27e50, 0xc0006f0d80)
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:93 +0x2ca
created by github.com/DataDog/datadog-agent/pkg/autodiscovery.(*configPoller).start
/home/datadog/devel/DataDog/datadog-agent/pkg/autodiscovery/config_poller.go:66 +0x14b
```
186e8a7 to
58d016f
Compare
…/no_liveness_depending_on_network
dc77c73 to
44e3b35
Compare
…etadata-*` components
The goal is to avoid having the `metadata-host` component becoming unhealthy because it is stuck while waiting for a kubelet reply at this point:
```
2021-03-17 07:30:27 UTC | CORE | INFO | (pkg/api/healthprobe/healthprobe.go:72 in healthHandler) | Healthcheck failed on: [metadata-host]
2021-03-17 07:30:28 UTC | CORE | DEBUG | (pkg/api/healthprobe/healthprobe.go:73 in healthHandler) | goroutines stack:
goroutine 550 [select]:
net/http.(*Transport).getConn(0xc000c39900, 0xc000fba600, 0x0, 0xc000f2eaa0, 0x5, 0xc0014fb248, 0x12, 0x0, 0x0, 0x0, ...)
/usr/local/go/src/net/http/transport.go:1368 +0x589
net/http.(*Transport).roundTrip(0xc000c39900, 0xc0009d8c00, 0xf8, 0xc000704f01, 0xc0009d8c00)
/usr/local/go/src/net/http/transport.go:579 +0x7eb
net/http.(*Transport).RoundTrip(0xc000c39900, 0xc0009d8c00, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a)
/usr/local/go/src/net/http/roundtrip.go:17 +0x35
net/http.send(0xc0009d8b00, 0x39e2820, 0xc000c39900, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0xc00c89e766d35916, 0x1, 0x0)
/usr/local/go/src/net/http/client.go:251 +0x454
net/http.(*Client).send(0xc000ca40b0, 0xc0009d8b00, 0xc00c89e766d35916, 0x1c1eb37d84a, 0x4d88a20, 0xc0015382d0, 0x0, 0x1, 0x7f0469ebef50)
/usr/local/go/src/net/http/client.go:175 +0xff
net/http.(*Client).do(0xc000ca40b0, 0xc0009d8b00, 0x0, 0x0, 0x0)
/usr/local/go/src/net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
/usr/local/go/src/net/http/client.go:585
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*kubeletClient).query(0xc000ca40b0, 0x3a3e1a0, 0xc000058068, 0x360c0ee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet_client.go:124 +0x274
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).QueryKubelet(...)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:360
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetLocalPodList(0xc000984820, 0x3a3e1a0, 0xc000058068, 0xc0009ef948, 0xc00015f9b0, 0xc0003c7ee8, 0xffffffffffffffff, 0xffffffffffffffff)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:185 +0x145
github.com/DataDog/datadog-agent/pkg/util/kubernetes/kubelet.(*KubeUtil).GetNodename(0xc000984820, 0x3a3e1a0, 0xc000058068, 0x0, 0x0, 0x0, 0x0)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/kubelet/kubelet.go:154 +0x45
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetNodeLabels(0xc000f26cf0, 0xc0009efc60, 0xc0009ef9e8)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/node_labels.go:25 +0x68
github.com/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo.GetTags(0x322e180, 0xc0009efba0, 0xc0009efc00, 0x3, 0x3)
/home/datadog/devel/DataDog/datadog-agent/pkg/util/kubernetes/hostinfo/tags.go:26 +0x49
github.com/DataDog/datadog-agent/pkg/metadata/host.GetHostTags(0xc0008dc500, 0x3)
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host_tags.go:126 +0x6c8
github.com/DataDog/datadog-agent/pkg/metadata/host.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host/host.go:60 +0xe5
github.com/DataDog/datadog-agent/pkg/metadata/v5.GetPayload(0xc00005fc00, 0x40, 0x360a277, 0x3, 0x0)
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/v5/v5.go:22 +0x6f
github.com/DataDog/datadog-agent/pkg/metadata.(*HostCollector).Send(0x52ed690, 0xc0007df470, 0x0, 0x0)
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/host.go:25 +0x2b
github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector.func1(0xc0006c03f0, 0xc00067bfe0, 0x1a3185c5000, 0x39df120, 0x52ed690, 0x360b490, 0x4)
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:100 +0x1a8
created by github.com/DataDog/datadog-agent/pkg/metadata.(*Scheduler).AddCollector
/home/datadog/devel/DataDog/datadog-agent/pkg/metadata/scheduler.go:87 +0x1f9
```
44e3b35 to
56d7729
Compare
…/no_liveness_depending_on_network
02ccf50 to
adaac09
Compare
adaac09 to
ec9f718
Compare
…istening`, `ad-dockerlistener` and `ad-dockerprovider` components
28b589f to
27ba364
Compare
| GetEntity() string // unique entity name | ||
| GetTaggerEntity() string // tagger entity name | ||
| GetADIdentifiers(context.Context) ([]string, error) // identifiers on which templates will be matched | ||
| GetHosts(context.Context) (map[string]string, error) // network --> IP address | ||
| GetPorts(context.Context) ([]ContainerPort, error) // network ports | ||
| GetTags() ([]string, string, error) // tags and tags hash | ||
| GetPid(context.Context) (int, error) // process identifier | ||
| GetHostname(context.Context) (string, error) // hostname.domainname for the entity | ||
| GetCreationTime() integration.CreationTime // created before or after the agent start | ||
| IsReady(context.Context) bool // is the service ready | ||
| GetCheckNames(context.Context) []string // slice of check names defined in kubernetes annotations or docker labels | ||
| HasFilter(containers.FilterType) bool // whether the service is excluded by metrics or logs exclusion config | ||
| GetExtraConfig([]byte) ([]byte, error) // Extra configuration values |
There was a problem hiding this comment.
As most of the implementations of Service do not use context.Context, I am wondering if it would be possible to remove context.Context and replace it by a Cancel method. The Cancel method could cancel the last request if any. You can also propagate the information with a channel when a request timeout if you also need this information.
Note: You can create 2 interfaces to avoid a service to implement a empty cancel method when the requests are not cancellable :
type Service interface {
GetEntity() string
}
type CancellableService interface {
Service
Cancel()
}When you have an instance of Service you can dynamically check whether the instance also implement CancellableService and call Cancel only when needed.
dcd6798 to
af49b5b
Compare
…o_liveness_depending_on_network
Conflicts:
pkg/autodiscovery/listeners/docker.go
pkg/autodiscovery/providers/docker.go
pkg/autodiscovery/providers/kubelet.go
pkg/autodiscovery/providers/providers.go
pkg/clusteragent/orchestrator/status.go
pkg/collector/corechecks/cluster/orchestrator/orchestrator.go
pkg/metadata/host/host.go
pkg/process/checks/container.go
pkg/process/config/config.go
pkg/tagger/collectors/docker_main.go
pkg/tagger/collectors/ecs_extract.go
pkg/tagger/collectors/ecs_extract_test.go
pkg/tagger/collectors/ecs_main.go
pkg/tagger/local/tagger.go
pkg/util/azure/azure.go
pkg/util/azure/azure_test.go
pkg/util/docker/docker_util.go
pkg/util/ec2/ec2_tags.go
pkg/util/ecs/common.go
pkg/util/gce/gce.go
pkg/util/gce/gce_test.go
af49b5b to
46531e0
Compare
…o_liveness_depending_on_network
Co-authored-by: Cedric Lamoriniere <cedric.lamoriniere@datadoghq.com>
…DataDog/datadog-agent into lenaic/no_liveness_depending_on_network
mfpierre
left a comment
There was a problem hiding this comment.
Looking good for @DataDog/container-app
| defaultOffsetThreshold := 60 | ||
|
|
||
| defaultHosts := util.GetCloudProviderNTPHosts() | ||
| defaultHosts := util.GetCloudProviderNTPHosts(context.TODO()) |
There was a problem hiding this comment.
What's the plan for removing this TODO?
There was a problem hiding this comment.
This TODO() is in a core check.
Unfortunately, today, there is no timeout for checks execution.
There are some checks which are known to last several minutes like the KSM or SNMP ones for ex, but we cannot know for the custom checks written by our users.
Introducing a time-out for check execution would be a breaking change that is not in the scope of this work.
As a consequence, we could imagine to remove this TODO() and to add a context.Context parameter to this function in order to move the context.Context creation upper in the function call tree.
But anyhow, the context.Context created in the scope of a (core) check will be an “empty” context (either context.Background() or context.TODO(), so, it wouldn’t change the behaviour.
The code could look better if all the context.Context creation were moved as high as possible in the function call tree but I though I would do it in another PR because this one already implements the context.Context propagation in the scopes where a deadline can be defined (i.e.: event loops monitored by health probe, but not in checks, nor in interactive agent CLI commands)
There was a problem hiding this comment.
Thanks for the explanation!
The code could look better if all the context.Context creation were moved as high as possible in the function call tree but I though I would do it in another PR because this one already implements the context.Context propagation in the scopes where a deadline can be defined
that makes sense, let's do it in another PR
Also, it would be good IMO to track TODOs somewhere (at least for the ones that are not going to be just replaced by Background)
…o_liveness_depending_on_network
…o_liveness_depending_on_network
What does this PR do?
Do not make the liveness probe fail in case of network issue.
This is a follow-up of #7579.
context.Contextparameter to all theKubeUtilmethod.healthframework to push in ticks the deadline a component shouldn’t miss to not be considered as unhealthy.KubeUtilfor thead-kubeletlistenercomponent to avoid having the goroutine stuck there:KubeUtilfor thead-config-provider-*components to avoid having the goroutine stuck there:KubeUtilfor themetadata-*components to avoid having the goroutine stuck there:Motivation
On Kubernetes, a liveness probe failure makes kubernetes restart the container.
As a consequence, the liveness probe mustn’t fail for reasons that are external to the agent.
In particular, if there is a network issue so that the agent cannot talk to the kubelet anymore, restarting the agent will not help fixing the network. So, the liveness probe mustn’t fail for that reason.
Additional Notes
We noticed that, in case of network issue, the liveness probe of the agent was flapping.
Here is the plan to fix that:
As this PR is quite big, here is a review order suggestion that should give an easier understanding of what this PR does:
ad-servicelisteningcomponent ;ad-config-provider-*component ;ad-dockerlistenercomponent ;ad-ecslistenercomponent ;ad-kubeletlistenercomponent ;context.Contextwhen doing a remote callcontext.Contextin each function between the components monitored by a health probe to the components doing remote calls.Describe your test plan
Start an agent pod and blackhole all the network traffic to/from that pod except the connections from the kubelet to the liveness and readiness probes:
Once those commands have been run on the host, the agent pod should remain up and running and the logs should show that a lot of network calls are aborted with a
context deadline exceedederror. But the liveness probe should remain healthy.