Skip to content

Add prometheus throttle#283

Merged
Tzvonimir merged 4 commits into
mainfrom
tzvonimir/cpu-events
Feb 18, 2026
Merged

Add prometheus throttle#283
Tzvonimir merged 4 commits into
mainfrom
tzvonimir/cpu-events

Conversation

@Tzvonimir
Copy link
Copy Markdown
Contributor

@Tzvonimir Tzvonimir commented Feb 18, 2026

[Title]

📚 Description of Changes

Provide an overview of your changes and why they're needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Summary by Gitar

  • New metric field:
    • Added cpu_throttle_fraction (field 18) to ContainerMetricItem proto for tracking CPU CFS throttle ratios (0.0-1.0)
  • Prometheus collection:
    • Implemented collectContainerCPUThrottleMetrics() querying container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total with 5-minute rate windows
  • MPA streaming integration:
    • Updated mpa_server.go Broadcast() to include CpuThrottleFraction in real-time metric streams to connected clients
  • Performance monitoring:
    • Enables tracking container CPU throttling behavior for multi-dimensional pod autoscaling decisions

This will update automatically on new commits.


Comment thread internal/collector/container_resource_collector.go Outdated
Comment thread internal/collector/container_resource_collector.go
@Tzvonimir Tzvonimir force-pushed the tzvonimir/cpu-events branch from f759e54 to d5f5074 Compare February 18, 2026 18:16
Comment on lines +737 to +743
throttledResult, _, err := c.prometheusAPI.Query(ctx, throttledQuery, queryTime)
if err != nil {
return 0, fmt.Errorf("error querying throttled periods: %w", err)
}

// Query total periods rate
totalResult, _, err := c.prometheusAPI.Query(ctx, totalQuery, queryTime)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Performance: Two separate Prometheus queries could be combined into one

The collectContainerCPUThrottleMetrics function issues two separate Prometheus queries — one for container_cpu_cfs_throttled_periods_total and one for container_cpu_cfs_periods_total. In clusters with many containers, this doubles the query load for this feature (2 extra queries × N containers per collection cycle).

These can be combined into a single PromQL expression that computes the fraction server-side:

sum(rate(container_cpu_cfs_throttled_periods_total{...}[5m])) / sum(rate(container_cpu_cfs_periods_total{...}[5m]))

This returns NaN when the denominator is zero, which you already handle with the guard at line 763-766. This halves the Prometheus query load for this metric.

Was this helpful? React with 👍 / 👎

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Feb 18, 2026

Code Review 👍 Approved with suggestions 1 resolved / 3 findings

Clean implementation of CPU throttle metrics with good NaN/Inf handling. The previous finding about missing DisableNetworkIOMetrics gating remains unresolved, and the two Prometheus queries per container could be combined into one for efficiency.

💡 Performance: Two separate Prometheus queries could be combined into one

📄 internal/collector/container_resource_collector.go:737-743

The collectContainerCPUThrottleMetrics function issues two separate Prometheus queries — one for container_cpu_cfs_throttled_periods_total and one for container_cpu_cfs_periods_total. In clusters with many containers, this doubles the query load for this feature (2 extra queries × N containers per collection cycle).

These can be combined into a single PromQL expression that computes the fraction server-side:

sum(rate(container_cpu_cfs_throttled_periods_total{...}[5m])) / sum(rate(container_cpu_cfs_periods_total{...}[5m]))

This returns NaN when the denominator is zero, which you already handle with the guard at line 763-766. This halves the Prometheus query load for this metric.

💡 Quality: CPU throttle queries not gated by DisableNetworkIOMetrics flag

📄 internal/collector/container_resource_collector.go:370-380

The CPU throttle metric collection at line 370-380 runs whenever prometheusAPI != nil, but it is not gated by !c.config.DisableNetworkIOMetrics like network metrics (line 338) and IO metrics (line 384) are.

Currently this doesn't cause a problem because the Prometheus client is only initialized when both !DisableNetworkIOMetrics && !DisableGPUMetrics (line 164). However, if the Prometheus initialization logic is ever relaxed (e.g., to support GPU-only or throttle-only queries), throttle queries would unexpectedly fire even when the operator intended to disable Prometheus-based metrics.

Consider either:

  1. Adding a dedicated DisableCPUThrottleMetrics config flag, or
  2. Gating it behind !c.config.DisableNetworkIOMetrics for consistency with other Prometheus-sourced metrics, or
  3. Adding a comment explaining why it's intentionally ungated.
✅ 1 resolved
Edge Case: NaN from Prometheus bypasses clamping, propagates into gRPC

📄 internal/collector/container_resource_collector.go:762-774
Prometheus rate() can return NaN (e.g., when a counter is missing, resets, or has insufficient data points in the window). When vector[0].Value is NaN, the float64 conversion preserves it. The issue is that IEEE 754 NaN comparisons are always false:

  • totalRate <= 0 → false (NaN is not ≤ 0), so division proceeds
  • throttledRate / totalRate where either is NaN → NaN
  • fraction < 0 → false, fraction > 1 → false — clamping is completely bypassed

The resulting NaN value propagates into CpuThrottledFraction, gets serialized into the protobuf double cpu_throttle_fraction field, and is sent over gRPC. Downstream consumers (Dakr autoscaler) may not handle NaN correctly, potentially causing incorrect scaling decisions or crashes.

Fix: Add math.IsNaN / math.IsInf guards before the arithmetic:

import "math"

// Guard against NaN/Inf from Prometheus and missing CFS data
if math.IsNaN(throttledRate) || math.IsInf(throttledRate, 0) ||
    math.IsNaN(totalRate) || math.IsInf(totalRate, 0) || totalRate <= 0 {
    return 0, nil
}

fraction := throttledRate / totalRate
if fraction < 0 || math.IsNaN(fraction) {
    fraction = 0
}
if fraction > 1 {
    fraction = 1
}
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@Tzvonimir Tzvonimir merged commit 44d1ef9 into main Feb 18, 2026
25 of 26 checks passed
@Tzvonimir Tzvonimir deleted the tzvonimir/cpu-events branch February 18, 2026 21:45
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
* Add prometheus throttle

* Fix up math issue

* Update proto

* Update readme
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants