Add initial support for monitoring GPUs on Linux#1998
Add initial support for monitoring GPUs on Linux#1998SuperQ merged 1 commit intoprometheus:masterfrom siavashs:drm
Conversation
|
Sample metrics from a Vega64: # HELP node_drm_card_info Card information
# TYPE node_drm_card_info gauge
node_drm_card_info{card="card0",memory_vendor="samsung",power_performance_level="manual",unique_id="1234567890",vendor="amd"} 1
# HELP node_drm_gpu_busy_percent How busy the GPU is as a percentage.
# TYPE node_drm_gpu_busy_percent gauge
node_drm_gpu_busy_percent{card="card0"} 10
# HELP node_drm_memory_gtt_size_bytes The size of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_size_bytes gauge
node_drm_memory_gtt_size_bytes{card="card0"} 8.573157376e+09
# HELP node_drm_memory_gtt_used_bytes The used amount of the graphics translation table (GTT) block in bytes.
# TYPE node_drm_memory_gtt_used_bytes gauge
node_drm_memory_gtt_used_bytes{card="card0"} 1.48447232e+08
# HELP node_drm_memory_vis_vram_size_bytes The size of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_size_bytes gauge
node_drm_memory_vis_vram_size_bytes{card="card0"} 2.68435456e+08
# HELP node_drm_memory_vis_vram_used_bytes The used amount of visible VRAM in bytes.
# TYPE node_drm_memory_vis_vram_used_bytes gauge
node_drm_memory_vis_vram_used_bytes{card="card0"} 1.13287168e+08
# HELP node_drm_memory_vram_size_bytes The size of VRAM in bytes.
# TYPE node_drm_memory_vram_size_bytes gauge
node_drm_memory_vram_size_bytes{card="card0"} 8.573157376e+09
# HELP node_drm_memory_vram_used_bytes The used amount of VRAM in bytes.
# TYPE node_drm_memory_vram_used_bytes gauge
node_drm_memory_vram_used_bytes{card="card0"} 1.773531136e+09 |
|
FreeBSD supports the same Linux driver but I'm not sure if it exposes the DRM information through sysfs. |
|
Nice! But we should move the parsing to https://github.com/prometheus/procfs - Can you submit a PR there? That'd be great! |
|
@discordianfish prometheus/procfs#370 |
|
@SuperQ when will the next |
|
@discordianfish this is refactored and ready for review 😄 |
|
I think the only required change would be to set the |
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
|
Any update on this? |
|
@SuperQ Can you take a look? |
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [CHANGE] Exclude filesystems under /run/credentials #2157
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add Darwin thermal collector #2032
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add DMI collector #2131
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [CHANGE] Exclude filesystems under /run/credentials #2157
* [FEATURE] Add darwin powersupply collector #1777
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add Darwin thermal collector #2032
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add DMI collector #2131
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [CHANGE] Exclude filesystems under /run/credentials #2157
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771
* [FEATURE] Add darwin powersupply collector #1777
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add Darwin thermal collector #2032
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add DMI collector #2131
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [CHANGE] Exclude filesystems under /run/credentials #2157
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771
* [FEATURE] Add darwin powersupply collector #1777
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add Darwin thermal collector #2032
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add DMI collector #2131
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector #2146
* [CHANGE] Exclude filesystems under /run/credentials #2157
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ #1771
* [FEATURE] Add darwin powersupply collector #1777
* [FEATURE] Add support for monitoring GPUs on Linux #1998
* [FEATURE] Add Darwin thermal collector #2032
* [FEATURE] Add os release collector #2094
* [FEATURE] Add netdev.address-info collector #2105
* [ENHANCEMENT] Support glob textfile collector directories #1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric #2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering #2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics #2123
* [ENHANCEMENT] Add DMI collector #2131
* [ENHANCEMENT] Add threads metrics to processes collector #2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector #2169
* [BUGFIX] ethtool: Sanitize metric names #2093
* [BUGFIX] Fix ethtool collector for multiple interfaces #2126
* [BUGFIX] Fix possible panic on macOS #2133
* [BUGFIX] Collect flag_info and bug_info only for one core #2156
Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector prometheus#2146
* [CHANGE] Exclude filesystems under /run/credentials prometheus#2157
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ prometheus#1771
* [FEATURE] Add darwin powersupply collector prometheus#1777
* [FEATURE] Add support for monitoring GPUs on Linux prometheus#1998
* [FEATURE] Add Darwin thermal collector prometheus#2032
* [FEATURE] Add os release collector prometheus#2094
* [FEATURE] Add netdev.address-info collector prometheus#2105
* [ENHANCEMENT] Support glob textfile collector directories prometheus#1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric prometheus#2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering prometheus#2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics prometheus#2123
* [ENHANCEMENT] Add DMI collector prometheus#2131
* [ENHANCEMENT] Add threads metrics to processes collector prometheus#2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector prometheus#2169
* [BUGFIX] ethtool: Sanitize metric names prometheus#2093
* [BUGFIX] Fix ethtool collector for multiple interfaces prometheus#2126
* [BUGFIX] Fix possible panic on macOS prometheus#2133
* [BUGFIX] Collect flag_info and bug_info only for one core prometheus#2156
Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using `sysfs/drm`. `amdgpu` is the only driver which exposes this information through DRM. Signed-off-by: Siavash Safi <siavash.safi@gmail.com>
NOTE: In order to support globs in the textfile collector path, filenames exposed by
`node_textfile_mtime_seconds` now contain the full path name.
* [CHANGE] Add path label to rapl collector prometheus#2146
* [CHANGE] Exclude filesystems under /run/credentials prometheus#2157
* [FEATURE] Add lnstat collector for metrics from /proc/net/stat/ prometheus#1771
* [FEATURE] Add darwin powersupply collector prometheus#1777
* [FEATURE] Add support for monitoring GPUs on Linux prometheus#1998
* [FEATURE] Add Darwin thermal collector prometheus#2032
* [FEATURE] Add os release collector prometheus#2094
* [FEATURE] Add netdev.address-info collector prometheus#2105
* [ENHANCEMENT] Support glob textfile collector directories prometheus#1985
* [ENHANCEMENT] ethtool: Expose node_ethtool_info metric prometheus#2080
* [ENHANCEMENT] Use include/exclude flags for ethtool filtering prometheus#2165
* [ENHANCEMENT] Add flag to disable guest CPU metrics prometheus#2123
* [ENHANCEMENT] Add DMI collector prometheus#2131
* [ENHANCEMENT] Add threads metrics to processes collector prometheus#2164
* [ENHANCMMENT] Reduce timer GC delays in the Linux filesystem collector prometheus#2169
* [BUGFIX] ethtool: Sanitize metric names prometheus#2093
* [BUGFIX] Fix ethtool collector for multiple interfaces prometheus#2126
* [BUGFIX] Fix possible panic on macOS prometheus#2133
* [BUGFIX] Collect flag_info and bug_info only for one core prometheus#2156
Signed-off-by: Ben Kochie <superq@gmail.com>
Expose GPU metrics using
sysfs/drm.amdgpuis the only driver which exposes this information through DRM.