Skip to content

node_cpu_core_throttles_total: per core, not per cpu #659

@knweiss

Description

@knweiss

@rtreffer @SuperQ

Host operating system: output of uname -a

# uname -a
Linux haswell 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

$ ./node_exporter --version
node_exporter, version 0.14.0 (branch: package_throttles_total, revision: 60ee361e86cb1457151753f0aa8c0da976c6bc26)

node_exporter command line flags

./node_exporter --collectors.enabled=cpu --log.level="debug"

Are you running node_exporter in Docker?

No, on physical multi-core systems.

What did you do that produced an error?

I am testing the node_cpu_core_throttles_total metric. As the metric name indicates, this is a per (physical) core metric and not a per (logical) cpu metric.

However, node_exporter currently creates two identical time series for each physical core if Hyper-Threading is enable.

# HELP node_cpu_core_throttles_total Number of times this cpu core has been throttled.
# TYPE node_cpu_core_throttles_total counter
node_cpu_core_throttles_total{cpu="cpu1"} 61
node_cpu_core_throttles_total{cpu="cpu2"} 3
node_cpu_core_throttles_total{cpu="cpu4"} 108
node_cpu_core_throttles_total{cpu="cpu9"} 49
node_cpu_core_throttles_total{cpu="cpu25"} 61
node_cpu_core_throttles_total{cpu="cpu26"} 3
node_cpu_core_throttles_total{cpu="cpu28"} 108
node_cpu_core_throttles_total{cpu="cpu33"} 49

(I've omitted the metrics with value 0.)

What did you expect to see?

I expected node_cpu_core_throttles_total metric for each physical core (24 in my case) and not for each logical cpu (48).

This creates lots of redundant time series on multi-core systems.

What did you see instead?

The /sys file system of one of my test machines looks like this:

# for i in /sys/bus/cpu/devices/cpu{0..47}/thermal_throttle/core_throttle_count; do\
 echo "$i : $(cat $i)"; done | grep -vw 0
/sys/bus/cpu/devices/cpu1/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu2/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu4/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu9/thermal_throttle/core_throttle_count : 49
/sys/bus/cpu/devices/cpu25/thermal_throttle/core_throttle_count : 61
/sys/bus/cpu/devices/cpu26/thermal_throttle/core_throttle_count : 3
/sys/bus/cpu/devices/cpu28/thermal_throttle/core_throttle_count : 108
/sys/bus/cpu/devices/cpu33/thermal_throttle/core_throttle_count : 49
# lscpu | grep ^NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-11,24-35
NUMA node1 CPU(s):     12-23,36-47

Notice how each core metric is present twice in the fs (once for each HT sibling) and the cpu collector replicates this its node_cpu_core_throttles_total metrics:

See my PR #657 for another, similar problem.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions