Scrape CPU latency stats from /proc/schedstat#1389
Conversation
7196019 to
2a97401
Compare
|
@SuperQ @discordianfish I reckon this is ready for review. This is my first node_exporter PR so please let me know if there are any things to fix. |
SuperQ
left a comment
There was a problem hiding this comment.
Looks pretty good so far.
I know it's a bit of a pain, but we're trying to move proc file parsing to prometheus/procfs. We're trying to avoid adding new code that uses procFilePath() and adds parsing directly in the node_exporter.
|
No problem, I'll move the parsing logic over there. |
|
Sorry, go changes its dependency management practices every time I use it, and I'm not sure how best to convince CI to build with prometheus/procfs#186 before those changes have been accepted and released. But the changes are pretty straightforward so it shouldn't be too hard to figure it out on the fly. |
|
Yea, it's hard to do external tests on a remote branch without doing crazy stuff with go modules. |
df33ade to
27f24da
Compare
|
@SuperQ procfs changes have been merged, and I've updated the PR. Would you please re-review? |
|
Not sure what's up with buildkite. |
|
The other end-to-end fixture needs to be updated. ( |
|
@SuperQ fixed! 🎉 |
|
Nice! |
|
Two final things. Please add the new collector to the README. Also, add it to the changelog. |
These are useful as a direct indication of CPU contention and task scheduler latency. Handy references: - https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-stats.txt - https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html procfs is updated to pull in the enabling change: prometheus/procfs#186 Signed-off-by: Phil Frost <phil@postmates.com>
|
@SuperQ done. |
|
@SuperQ bump, any more to do here? |
|
@SuperQ there's a bug in this, which I've fixed at prometheus/procfs#191. Please let's postpone any release until that's incorporated. |
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. #1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. #1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. #1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. #1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. #1279 * [CHANGE] Ignore iso9600 filesystem on Linux #1355 * [CHANGE] Refactor mdadm collector #1403 * [CHANGE] Add `mountaddr` label to NFS metrics. #1417 * [CHANGE] Don't count empty collectors as success. #1613 * [FEATURE] New flag to disable default collectors #1276 * [FEATURE] Add experimental TLS support #1277, #1687, #1695 * [FEATURE] Add collector for Power Supply Class #1280 * [FEATURE] Add new schedstat collector #1389 * [FEATURE] Add FreeBSD zfs support #1394 * [FEATURE] Add uname support for Darwin and OpenBSD #1433 * [FEATURE] Add new metric node_cpu_info #1489 * [FEATURE] Add new thermal_zone collector #1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector #1445 * [FEATURE] Add swap usage on darwin #1508 * [FEATURE] Add Btrfs collector #1512 * [FEATURE] Add RAPL collector #1523 * [FEATURE] Add new softnet collector #1576 * [FEATURE] Add new udp_queues collector #1503 * [FEATURE] Add basic authentication #1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats #1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state #1357 * [ENHANCEMENT] Include additional XFS runtime statistics. #1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. #1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label #1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. #1413 * [ENHANCEMENT] Add a flag to adjust mount timeout #1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 #1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors #1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. #1552 * [ENHANCEMENT] Add infiniband info metric #1563 * [ENHANCEMENT] Add unix socket support for supervisord collector #1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo #1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric #1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. #1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. #1636 * [ENHANCEMENT] Add perf tracepoint collection flag #1664 * [ENHANCEMENT] ZFS: read contents of objset file #1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing #1711 * [BUGFIX] Read /proc/net files with a single read syscall #1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. #1393 * [BUGFIX] Fix netdev nil reference on Darwin #1414 * [BUGFIX] Strip path.rootfs from mountpoint labels #1421 * [BUGFIX] Fix seconds reported by schedstat #1426 * [BUGFIX] Fix empty string in path.rootfs #1464 * [BUGFIX] Fix typo in cpufreq metric names #1510 * [BUGFIX] Read /proc/stat in one syscall #1538 * [BUGFIX] Fix OpenBSD cache memory information #1542 * [BUGFIX] Refactor textfile collector to avoid looping defer #1549 * [BUGFIX] Fix network speed math #1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version #1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs #1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux #1671 Signed-off-by: Ben Kochie <superq@gmail.com>
These are useful as a direct indication of CPU contention and task scheduler latency. Handy references: - https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-stats.txt - https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html procfs is updated to pull in the enabling change: prometheus/procfs#186 Signed-off-by: Phil Frost <phil@postmates.com>
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. prometheus#1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. prometheus#1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. prometheus#1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. prometheus#1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. prometheus#1279 * [CHANGE] Ignore iso9600 filesystem on Linux prometheus#1355 * [CHANGE] Refactor mdadm collector prometheus#1403 * [CHANGE] Add `mountaddr` label to NFS metrics. prometheus#1417 * [CHANGE] Don't count empty collectors as success. prometheus#1613 * [FEATURE] New flag to disable default collectors prometheus#1276 * [FEATURE] Add experimental TLS support prometheus#1277, prometheus#1687, prometheus#1695 * [FEATURE] Add collector for Power Supply Class prometheus#1280 * [FEATURE] Add new schedstat collector prometheus#1389 * [FEATURE] Add FreeBSD zfs support prometheus#1394 * [FEATURE] Add uname support for Darwin and OpenBSD prometheus#1433 * [FEATURE] Add new metric node_cpu_info prometheus#1489 * [FEATURE] Add new thermal_zone collector prometheus#1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector prometheus#1445 * [FEATURE] Add swap usage on darwin prometheus#1508 * [FEATURE] Add Btrfs collector prometheus#1512 * [FEATURE] Add RAPL collector prometheus#1523 * [FEATURE] Add new softnet collector prometheus#1576 * [FEATURE] Add new udp_queues collector prometheus#1503 * [FEATURE] Add basic authentication prometheus#1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats prometheus#1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state prometheus#1357 * [ENHANCEMENT] Include additional XFS runtime statistics. prometheus#1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. prometheus#1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label prometheus#1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. prometheus#1413 * [ENHANCEMENT] Add a flag to adjust mount timeout prometheus#1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 prometheus#1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors prometheus#1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. prometheus#1552 * [ENHANCEMENT] Add infiniband info metric prometheus#1563 * [ENHANCEMENT] Add unix socket support for supervisord collector prometheus#1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo prometheus#1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric prometheus#1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. prometheus#1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. prometheus#1636 * [ENHANCEMENT] Add perf tracepoint collection flag prometheus#1664 * [ENHANCEMENT] ZFS: read contents of objset file prometheus#1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing prometheus#1711 * [BUGFIX] Read /proc/net files with a single read syscall prometheus#1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. prometheus#1393 * [BUGFIX] Fix netdev nil reference on Darwin prometheus#1414 * [BUGFIX] Strip path.rootfs from mountpoint labels prometheus#1421 * [BUGFIX] Fix seconds reported by schedstat prometheus#1426 * [BUGFIX] Fix empty string in path.rootfs prometheus#1464 * [BUGFIX] Fix typo in cpufreq metric names prometheus#1510 * [BUGFIX] Read /proc/stat in one syscall prometheus#1538 * [BUGFIX] Fix OpenBSD cache memory information prometheus#1542 * [BUGFIX] Refactor textfile collector to avoid looping defer prometheus#1549 * [BUGFIX] Fix network speed math prometheus#1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version prometheus#1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs prometheus#1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux prometheus#1671 Signed-off-by: Ben Kochie <superq@gmail.com>
These are useful as a direct indication of CPU contention and task scheduler latency. Handy references: - https://github.com/torvalds/linux/blob/master/Documentation/scheduler/sched-stats.txt - https://doc.opensuse.org/documentation/leap/tuning/html/book.sle.tuning/cha.tuning.taskscheduler.html procfs is updated to pull in the enabling change: prometheus/procfs#186 Signed-off-by: Phil Frost <phil@postmates.com>
* The netdev collector CLI argument `--collector.netdev.ignored-devices` was renamed to `--collector.netdev.device-blacklist` in order to conform with the systemd collector. prometheus#1279 * The label named `state` on `node_systemd_service_restart_total` metrics was changed to `name` to better describe the metric. prometheus#1393 * Refactoring of the mdadm collector changes several metrics - `node_md_disks_active` is removed - `node_md_disks` now has a `state` label for "fail", "spare", "active" disks. - `node_md_is_active` is replaced by `node_md_state` with a state set of "active", "inactive", "recovering", "resync". * Additional label `mountaddr` added to NFS device metrics to distinguish mounts from the same URL, but different IP addresses. prometheus#1417 * Metrics node_cpu_scaling_frequency_min_hrts and node_cpu_scaling_frequency_max_hrts of the cpufreq collector were renamed to node_cpu_scaling_frequency_min_hertz and node_cpu_scaling_frequency_max_hertz. prometheus#1510 * Collectors that are enabled, but are unable to find data to collect, now return 0 for `node_scrape_collector_success`. * [CHANGE] Add `--collector.netdev.device-whitelist`. prometheus#1279 * [CHANGE] Ignore iso9600 filesystem on Linux prometheus#1355 * [CHANGE] Refactor mdadm collector prometheus#1403 * [CHANGE] Add `mountaddr` label to NFS metrics. prometheus#1417 * [CHANGE] Don't count empty collectors as success. prometheus#1613 * [FEATURE] New flag to disable default collectors prometheus#1276 * [FEATURE] Add experimental TLS support prometheus#1277, prometheus#1687, prometheus#1695 * [FEATURE] Add collector for Power Supply Class prometheus#1280 * [FEATURE] Add new schedstat collector prometheus#1389 * [FEATURE] Add FreeBSD zfs support prometheus#1394 * [FEATURE] Add uname support for Darwin and OpenBSD prometheus#1433 * [FEATURE] Add new metric node_cpu_info prometheus#1489 * [FEATURE] Add new thermal_zone collector prometheus#1425 * [FEATURE] Add new cooling_device metrics to thermal zone collector prometheus#1445 * [FEATURE] Add swap usage on darwin prometheus#1508 * [FEATURE] Add Btrfs collector prometheus#1512 * [FEATURE] Add RAPL collector prometheus#1523 * [FEATURE] Add new softnet collector prometheus#1576 * [FEATURE] Add new udp_queues collector prometheus#1503 * [FEATURE] Add basic authentication prometheus#1673 * [ENHANCEMENT] Log pid when there is a problem reading the process stats prometheus#1341 * [ENHANCEMENT] Collect InfiniBand port state and physical state prometheus#1357 * [ENHANCEMENT] Include additional XFS runtime statistics. prometheus#1423 * [ENHANCEMENT] Report non-fatal collection errors in the exporter metric. prometheus#1439 * [ENHANCEMENT] Expose IPVS firewall mark as a label prometheus#1455 * [ENHANCEMENT] Add check for systemd version before attempting to query certain metrics. prometheus#1413 * [ENHANCEMENT] Add a flag to adjust mount timeout prometheus#1486 * [ENHANCEMENT] Add new counters for flush requests in Linux 5.5 prometheus#1548 * [ENHANCEMENT] Add metrics and tests for UDP receive and send buffer errors prometheus#1534 * [ENHANCEMENT] The sockstat collector now exposes IPv6 statistics in addition to the existing IPv4 support. prometheus#1552 * [ENHANCEMENT] Add infiniband info metric prometheus#1563 * [ENHANCEMENT] Add unix socket support for supervisord collector prometheus#1592 * [ENHANCEMENT] Implement loadavg on all BSDs without cgo prometheus#1584 * [ENHANCEMENT] Add model_name and stepping to node_cpu_info metric prometheus#1617 * [ENHANCEMENT] Add `--collector.perf.cpus` to allow setting the CPU list for perf stats. prometheus#1561 * [ENHANCEMENT] Add metrics for IO errors and retires on Darwin. prometheus#1636 * [ENHANCEMENT] Add perf tracepoint collection flag prometheus#1664 * [ENHANCEMENT] ZFS: read contents of objset file prometheus#1632 * [ENHANCEMENT] Linux CPU: Cache CPU metrics to make them monotonically increasing prometheus#1711 * [BUGFIX] Read /proc/net files with a single read syscall prometheus#1380 * [BUGFIX] Renamed label `state` to `name` on `node_systemd_service_restart_total`. prometheus#1393 * [BUGFIX] Fix netdev nil reference on Darwin prometheus#1414 * [BUGFIX] Strip path.rootfs from mountpoint labels prometheus#1421 * [BUGFIX] Fix seconds reported by schedstat prometheus#1426 * [BUGFIX] Fix empty string in path.rootfs prometheus#1464 * [BUGFIX] Fix typo in cpufreq metric names prometheus#1510 * [BUGFIX] Read /proc/stat in one syscall prometheus#1538 * [BUGFIX] Fix OpenBSD cache memory information prometheus#1542 * [BUGFIX] Refactor textfile collector to avoid looping defer prometheus#1549 * [BUGFIX] Fix network speed math prometheus#1580 * [BUGFIX] collector/systemd: use regexp to extract systemd version prometheus#1647 * [BUGFIX] Fix initialization in perf collector when using multiple CPUs prometheus#1665 * [BUGFIX] Fix accidentally empty lines in meminfo_linux prometheus#1671 Signed-off-by: Ben Kochie <superq@gmail.com>
These are useful as a direct indication of CPU contention and task
scheduler latency.
Handy references:
procfs is updated to pull in the enabling change:
prometheus/procfs#186
Signed-off-by: Phil Frost phil@postmates.com