-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Description
Host operating system: output of uname -a
Linux lga-kubnode470 4.4.205-1.el7.elrepo.x86_64 #1 SMP Fri Nov 29 10:10:01 EST 2019 x86_64 x86_64 x86_64 GNU/Linux
node_exporter version: output of node_exporter --version
node_exporter, version 0.18.0 (branch: HEAD, revision: f97f01c)
build user: root@77cb1854c0b0
build date: 20190509-23:12:18
go version: go1.12.5
node_exporter command line flags
- --path.procfs=/host/proc
- --path.sysfs=/host/sys
- --collector.textfile.directory=/var/log/pulsepoint/prometheus
Are you running node_exporter in Docker?
Node exporter is running in docker in kubernetes cluster:
image: "prom/node-exporter:v0.18.0"
Node exporter daemonset manifest was created from stable helm chart:
There are 3 volumes:
name: proc
hostPath: /proc -> container readonly: /host/proc
hostPath: /sys -> container readonly: /host/sys
hostPath: /var/log/pulsepoint/prometheus -> container: /var/log/pulsepoint/prometheus (used for text files metrics)
No kubernetes security context
hostNetwork: true
hostPID: true
What did you do that produced an error?
We are running k8s cluster with kube-router based networking. So we are heavily using ipvs.
At some point we started to observe significant number of errors from all node-collectors about duplicate ipvs metrics (starting and heaving max value at 7:00 AM and then fading within an hour):
time="2019-12-19T12:25:10Z" level=error msg="
error gathering metrics: 324 error(s) occurred:
* [from Gatherer #2] collected metric "node_ipvs_backend_connections_active" { label:<name:"local_address" value:"10.203.128.184" > label:<name:"local_port" value:"9001" > label:<name:"proto" value:"TCP" > label:<name:"remote_address" value:"10.204.57.184" > label:<name:"remote_port" value:"9001" > gauge:<value:0 > } was collected before with the same name and label values
...
" source="log.go:172"
And we have many of that type for node_ipvs_backend_weight, node_ipvs_backend_connections_active, node_ipvs_backend_connections_inactive
We have scrape interval 15 seconds and 500 nodes cluster.
What did you expect to see?
No errors
What did you see instead?
Bunch of errors starting from some specific time with the same pattern: start at 7:00 AM and then fading within an hour.