Skip to content

[Feature Request] Allow excluding certain hwmon data sources #2681

@xen0n

Description

@xen0n

Host operating system

Linux lily 6.3.0-next-20230505-14814-g94feb5819da7 #1 SMP PREEMPT Tue Aug 30 11:11:44 AM CST 2022 loongarch64 GNU/Linux

This behavior is not tied to architecture quirks though, even with this exotic arch.

node_exporter version:

2023/05/06 14:50:14 github.com/josharian/native: unrecognized arch loong64 (LittleEndian), please file an issue
node_exporter, version 1.5.0 (branch: non-git, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)
  build user:       portage@lily
  build date:       20230104-05:23:50
  go version:       go1.19.4
  platform:         linux/loong64

node_exporter command line flags

/usr/sbin/node_exporter --collector.textfile.directory=/var/lib/node_exporter/

node_exporter log output

(not really applicable, current behavior is expected)

Are you running node_exporter in Docker?

No

What did you do that produced an error?

Just running node_exporter as a regular systemd service without any additional config.

What did you expect to see?

hwmon data collection wouldn't pull from an amdgpu source that's runtime suspended in D3hot state, or at least could be configured to exclude this source.

What did you see instead?

Each metrics pull wakes up the GPU that's only going back to sleep because there's no monitor attached, flooding the dmesg log. (Each wakeup produces 25 lines of amdgpu log, and in my case Prometheus pulls every 8~10 seconds, just enough for the GPU to go back to sleep again.) See also the finding on amd-gfx and dri-devel.

Right now without modifying sources I can only either (1) disable the hwmon collector altogether, or (2) modprobe amdgpu with runpm=0 that's going to waste some power. If this certain use case (running node_exporter on machines with GPUs but seldom used via GUI) is worthwhile to support, it could be useful to allow excluding certain hwmon sources based on name (e.g. amdgpu) or device ID (e.g. hwmon1).

Adding awareness for devices' power states could be useful too, but as not all device classes provide such information in their sysfs entries, this might or might not be easy. PCI devices all have power_state though, and we could choose to skip collection if power state is not D0 i.e. fully powered on.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions