Skip to content

Changing CPU overprovisioning factor breaks prometheus and listHosts usage metrics  #7591

@phsm

Description

@phsm
ISSUE TYPE
  • Bug Report
COMPONENT NAME
Prometheus Exporter
API
CLOUDSTACK VERSION
4.17.0
4.18.0
CONFIGURATION

N/A

OS / ENVIRONMENT

N/A

SUMMARY

When CPU overcommit factor is changed, the prometheus exporter metric "cloudstack_host_cpu_usage_mhz_total" as well as API response of listHosts (cpuused field) seems to be multiplied to the new overcommit factor.

The actual "used" metrics should not be affected by overcommit factor. Overcommit factor should only virtually increase the capacity the node has, and not affecting the usage metric.

STEPS TO REPRODUCE
1. Empty out a hypervisors from VMs, VRs, systemvms etc. So there is no virtual machines running on it.
2. Pick a virtual machine to start on that hypervisor. Before starting, note the amount of CPU cores and CPU Mhz it has, e.g two cores 500Mhz each.
3. After you have started the test virtual machine on the test hypervisor, check the Prometheus cloudstack_host_cpu_usage_mhz_total{hostname=<your test hypervisor metric>}. It should show the CPU Mhz used on that hypervisor: cpu_number * cpu_mhz, e.g. 1000. This is the correct value.
4. Now change the cluster setting cpu.overprovisioning.factor to a new value, e.g. 4. 
5. The cloudstack_host_cpu_usage_mhz_total{hostname=<your test hypervisor metric>} now shows different value, presumably calulated by the formula: cpu_number * cpu_mhz * (new_overprovisioning_factor - old_overprovisioning_factor)
6. If you stop and start the test VM, then the cloudstack_host_cpu_usage_mhz_total goes back to normal.

Same reproduce steps can be applied to the API response of listHosts call, field cpuused.
If you start a VM, then change overprovisioning factor, the field will contain incorrect value (especially if you put ridiculously high overprovisioning factor value, such as 1000).
EXPECTED RESULTS
The prometheus metric cloudstack_host_cpu_usage_mhz_total and API response of listHosts call (field cpuused) should not contain overprovisioning factor in their calculation as usage reports report on real usage.
ACTUAL RESULTS
The metric is reported without overprovisioning factor in its calculation when a VM starts, then gets distorted when you change overprovisioning factor.

Metadata

Metadata

Assignees

Type

Projects

Status

Dev In Progress

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions