Skip to content

DCGM agent integration for real GPU utilization #7

@maksimov

Description

@maksimov

Optionally collect real GPU utilization metrics via NVIDIA DCGM.

Currently EC2 GPU utilization is inferred from CPU/network proxy signals.
DCGM provides actual GPU compute and memory utilization, eliminating
false positives on instances with high GPU but low CPU usage.

  • Query DCGM exporter Prometheus endpoint if available
  • Fall back to proxy signals when DCGM is not present

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestv0.2Version 0.2 milestone

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions