Skip to content

Biday prometheus#375

Closed
Parthiba-Hazra wants to merge 36 commits into
mainfrom
ph/cut-the-prom-dep
Closed

Biday prometheus#375
Parthiba-Hazra wants to merge 36 commits into
mainfrom
ph/cut-the-prom-dep

Conversation

@Parthiba-Hazra
Copy link
Copy Markdown
Collaborator

@Parthiba-Hazra Parthiba-Hazra commented May 5, 2026

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Summary by Gitar

  • Removed Prometheus dependency:
    • Fully decommissioned Prometheus server, node-exporter, and kube-state-metrics components.
    • Simplified collectors by removing Prometheus-reliant fallbacks and query logic.
  • Unified metric collection:
    • Extended nodemon to aggregate node, container, and PVC metrics via stats/summary and /metrics/cadvisor endpoints.
    • Implemented a unified HTTP API for local metric consumption, replacing Prometheus as the primary data source.
  • Historical data refactoring:
    • Replaced HistoricalMetricsCollector with HistoricalPercentileCache, fetching pre-computed workload percentiles directly from DAKR control plane.
  • New internal services:
    • Introduced CAdvisorScraper for rate computation and PercentileFetcher for DAKR-backed historical metrics.
    • Added comprehensive integration testing for the new unified collector pipeline.

This will update automatically on new commits.

Parthiba-Hazra and others added 30 commits April 28, 2026 16:46
Covers data source replacement map (kubelet stats/summary + cAdvisor),
nodemon DaemonSet extension, historical percentile caching from DAKR
ClickHouse, and phased migration strategy with feature flags.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion logic

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement StatsPoller to fetch and parse the kubelet /stats/summary
endpoint, providing CPU, memory, network, and PVC volume metrics for
all pods on a node without a Prometheus dependency.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e, disk I/O, network

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/metrics endpoints

Introduce three new HTTP handlers backed by a UnifiedQuerier interface that serves
merged container metrics (CPU, memory, network, disk, throttle, GPU), node-level
aggregated rates, and PVC storage metrics as JSON. All endpoints support GET-only
with query-param filtering; tests cover JSON encoding, per-field values, filtering,
404 (no node data), and 405 (wrong method).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ic fetchers

Add UnifiedContainerMetric, UnifiedNodeMetric, and UnifiedPVCMetric types
mirroring nodemon's new unified endpoints, plus fetch*/FetchAll* methods for
/v2/container/metrics, /node/metrics, and /pvc/metrics with httptest coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds HistoricalPercentileProvider interface to internal/collector/interface.go
and updates MpaServer to depend on the interface rather than the concrete
*HistoricalMetricsCollector, enabling future substitution with a DAKR-backed cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HistoricalPercentileProvider with an in-memory cache that is
populated from a PercentileFetcher (DAKR control plane) on a 15-minute
ticker, so HPA recommendations no longer require a local Prometheus.
Stale data is preserved on fetch errors and health is reported as
Degraded. All five TDD test cases pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e flag

When EnableNodemonMetrics is true, the collector pre-fetches all container
metrics from nodemon once per cycle and uses them for network, disk I/O,
and CPU throttle metrics instead of querying Prometheus. Falls back to
Prometheus if the nodemon fetch fails.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r setup

Pass PolicyConfig to setupMpaServer so EnableNodemonMetrics is accessible;
establishes the historicalProvider wiring point for when the DAKR fetcher is available.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hind feature flag

- Add EnableNodemonMetrics field to NodeCollectorConfig and PersistentVolumeClaimMetricsCollectorConfig
- Add useNodemon bool and nodemonClient fields to both collector structs
- Implement collectNodeNetworkIOMetricsFromNodemon mapping UnifiedNodeMetric fields to Prometheus key names
- Extend UnifiedNodeMetric with packet/error/drop rate fields to support full network IO mapping
- Implement getFilesystemUsageFromNodemon using FetchAllPVCMetrics with namespace+name matching
- Add nodemon branch guard at top of collectNodeNetworkIOMetrics and getFilesystemUsageFromPrometheus
- Initialize nodemon client in PVCMetricsCollector.Start when useNodemon is set; skip Prometheus init
- Wire EnableNodemonMetrics flag through all 4 NodeCollector and 3 PVCMetricsCollector creation sites in the controller

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds TestIntegration_FullMetricsFlow to exercise the combined
StatsPoller + CAdvisorScraper pipeline end-to-end using httptest mock
servers — verifying CPU/memory/network from stats/summary and throttle,
disk-I/O, and packet rates from cAdvisor across two scrape cycles.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e percentile retrieval

Wire the DAKR K8SService.GetWorkloadContainerPercentiles RPC into zxporter
as a collector.PercentileFetcher implementation, replacing the Prometheus
dependency for historical workload metrics in the MPA server path.

- Add K8SServiceClient to RealDakrClient and initialize in NewDakrClient
- Create DakrPercentileFetcher with rate-limited concurrent RPC calls
- Map ContainerPercentileSummary (float64 cores/bytes) to
  ContainerHistoricalMetrics (int64 millicores/bytes)
- Add NewPercentileFetcher() to DakrClient interface and both implementations
- Wire fetcher into setupMpaServer when EnableNodemonMetrics is true
- Add comprehensive tests for mapping, error handling, and nil safety

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Metrics.enabled=true

When nodemonMetrics.enabled=true in values.yaml, the Helm chart no longer
deploys Prometheus server, kube-state-metrics, or node-exporter. The nodemon
DaemonSet collects metrics directly from kubelet, making these components
unnecessary. PROMETHEUS_URL is omitted from the ConfigMap and
ENABLE_NODEMON_METRICS is set to "true" so the application can detect the mode.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and cAdvisor scraper

Add UnifiedExporter that combines stats/summary, cAdvisor, and GPU data
into a single collection loop. Register /v2/container/metrics,
/node/metrics, /pvc/metrics on the HTTP mux. Uses K8s API server proxy
for kubelet access with proper auth.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NodeCollector and ContainerResourceCollector were still trying to
connect to Prometheus on startup even with useNodemon=true. Gate
the initPrometheusClient call behind !c.useNodemon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tch when nodemon enabled

- Add disableGPUMetrics value to Helm chart and wire into configmap
- Skip legacy /container/metrics GPU fetch when useNodemon=true (GPU
  data comes from unified /v2/container/metrics instead)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ContainerResourceCollector now builds PodMetricsList from nodemon's
stats/summary data (usageNanoCores, workingSetBytes) instead of
querying the Kubernetes Metrics API. NodeCollector gracefully handles
metrics-server unavailability when nodemon is enabled.

This eliminates the metrics-server dependency when nodemonMetrics is on.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ENABLE_NODEMON_METRICS=true, the entrypoint script skips the
metrics-server check and installation. CPU/memory now comes from
nodemon's stats/summary data, not the Kubernetes Metrics API.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ConfigMap is mounted as files at /etc/zxporter/config/, not as
env vars. The entrypoint now reads from the file mount to determine
whether to skip metrics-server installation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… flag

The nodemon client was only initialized when DisableGPUMetrics=false.
When GPU was disabled but nodemon metrics enabled, the client stayed
nil and the collector fell back to metrics-server.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When metrics-server is unavailable and useNodemon=true, the node
metrics list was empty causing zero node_resource batches to be sent.
Now builds the list from the node informer so the iteration loop runs
and node resources are produced with network/disk data from nodemon.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When useNodemon=true, the node collector now fetches all container
metrics from nodemon and aggregates CPU (nanocores) and memory
(working set bytes) per node. This populates the node Usage field
so cluster overview dashboards show non-zero values.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hen nodemon enabled

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…source

Remove the useNodemon/EnableNodemonMetrics feature flag and all Prometheus
query code paths from container, node, and PVC collectors. Nodemon is now
always used for network, disk I/O, CPU throttle, GPU, and PVC filesystem
metrics. This eliminates ~2600 lines of dual-path branching, Prometheus
client initialization, and fallback logic.

Key changes:
- Remove PrometheusURL, QueryTimeout, DisableNetworkIOMetrics, and
  EnableNodemonMetrics from all collector configs, PolicyConfig, and
  the CollectionPolicy CRD Policies struct
- Remove prometheusAPI field and all Prometheus client init from collectors
- Remove Prometheus query methods (collectPodNetworkMetrics,
  collectContainerIOMetrics, collectContainerCPUThrottleMetrics,
  collectContainerGPUMetrics, collectNodeNetworkIOMetrics,
  collectNodeGPUMetrics, getFilesystemUsageFromPrometheus)
- Rename nodemon methods to be the primary methods
- Remove waitForPrometheusAvailability from controller
- Remove HistoricalMetricsCollector (Prometheus-only); move
  HistoricalWorkloadQuery type to interface.go
- Delete Prometheus pinning tests (*_prom_test.go)
- Simplify entrypoint.sh to just start the application
- Historical percentile cache (DAKR-backed) is now always initialized
  when DakrClient is available

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Parthiba-Hazra and others added 6 commits May 1, 2026 19:47
HasExporters() discovers pods dynamically with a 30s TTL cache.
At startup, pods may not be discovered yet, causing IsAvailable()
to return false and the collector to be skipped entirely. Now just
checks that the nodemon client is initialized — the collection loop
gracefully handles empty nodemon responses.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fore Start

The controller calls IsAvailable() before Start(). Nodemon client was
initialized in Start(), so IsAvailable() always returned false (nil
client). Move init to constructor for all three collectors.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NetworkRxBytes/TxBytes from kubelet stats/summary are cumulative totals.
The MPA proto field NetworkReceiveBytesPerSec expects bytes/sec. Added
RateCalculator to convert cumulative counters to per-second rates,
matching what the old Prometheus rate(bytes_total[5m]) produced.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NodeMetricsResponse was a placeholder with all zeros. Now aggregates
cAdvisor per-container rates (disk I/O, network packets/errors) into
node-level totals, and computes node network byte rates from
stats/summary cumulative counters using RateCalculator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all Prometheus, node-exporter, and metrics-server references from:
- Makefile (variables, docker-build metrics-server generation,
  generate-monitoring-manifests target, build-installer prom sections)
- Dockerfile (metrics-server.yaml COPY)
- config/manager/env_configmap.yaml (PROMETHEUS_URL, DISABLE_NETWORK_IO_METRICS)
- config/prometheus/ directory (deleted entirely)
- dist/ (prometheus.yaml, node-exporter.yaml, metrics-server.yaml deleted)
- Helm chart configmap template and values.yaml (prometheusUrl removed)

Add nodemon as subchart dependency of zxporter helm chart and generate
nodemon manifest in build-installer target.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ConfigMaps missing {{ .Release.Namespace }} caused them to be created
in the default namespace when applied via kubectl apply -f, since
kubectl doesn't override namespace on resources without one.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Parthiba-Hazra Parthiba-Hazra changed the title Albida prometheus Biday prometheus May 5, 2026
Comment on lines +325 to +327
// GPU metrics are embedded in the unified container metrics when available.
// No separate GPU collection needed — data flows through nodemonContainerMetricsCache.
gpuMetrics = make(map[string]interface{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 Bug: Container GPU metrics always empty — data never extracted from cache

After removing the Prometheus GPU query path, gpuMetrics is unconditionally set to an empty map (make(map[string]interface{})) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from UnifiedContainerMetric and populates the gpuMetrics map before passing it to processContainerMetrics. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

Suggested fix:

// GPU metrics from nodemon unified cache
if c.nodemonContainerMetricsCache != nil {
    podKey := pod.Namespace + "/" + pod.Name
    if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
        for _, cm := range cms {
            if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
                gpuMetrics["gpu_utilization"] = cm.GPUUtilization
                gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
                gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
                gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
                gpuMetrics["gpu_temperature"] = cm.GPUTemperature
                break
            }
        }
    }
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

} else {
// Fallback: parse key "namespace/name/kind"
var ns, name, kind string
if _, err := fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind); err == nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys

In historical_percentile_cache.go:107, fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind) is used to parse keys of the form namespace/name/kind. However, %s in Go's fmt.Sscanf consumes until whitespace, not until /. The first %s will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

Suggested fix:

// Replace fmt.Sscanf with strings.Split
parts := strings.SplitN(key, "/", 3)
if len(parts) == 3 {
    workloads = append(workloads, HistoricalWorkloadQuery{
        Namespace:    parts[0],
        WorkloadName: parts[1],
        WorkloadKind: parts[2],
    })
}

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

if config != nil && r.DakrClient != nil {
fetcher := r.DakrClient.NewPercentileFetcher()
cache := collector.NewHistoricalPercentileCache(r.Log, fetcher, "" /* clusterID resolved from auth token */, r.HealthManager)
go cache.Start(context.Background())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context

In setupMpaServer (line 1989), go cache.Start(context.Background()) launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard if r.MpaServer != nil { return nil } prevents duplicates only if setupMpaServer isn't called after a full teardown.

Suggested fix:

// Store a cancel func on the reconciler and use a derived context:
cacheCtx, cacheCancel := context.WithCancel(context.Background())
r.historicalCacheCancel = cacheCancel  // new field on reconciler
go cache.Start(cacheCtx)

// Then in teardown/Stop, call r.historicalCacheCancel()

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 5, 2026

Code Review 🚫 Blocked 0 resolved / 3 findings

Decommissions the Prometheus build dependency in favor of a unified nodemon-based collector, but misses container GPU metrics due to an unpopulated map. Fails on key parsing in the historical percentile cache and lacks proper cancellation for the background cache routine.

🚨 Bug: Container GPU metrics always empty — data never extracted from cache

📄 internal/collector/container_resource_collector.go:325-327

After removing the Prometheus GPU query path, gpuMetrics is unconditionally set to an empty map (make(map[string]interface{})) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from UnifiedContainerMetric and populates the gpuMetrics map before passing it to processContainerMetrics. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

Suggested fix
// GPU metrics from nodemon unified cache
if c.nodemonContainerMetricsCache != nil {
    podKey := pod.Namespace + "/" + pod.Name
    if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
        for _, cm := range cms {
            if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
                gpuMetrics["gpu_utilization"] = cm.GPUUtilization
                gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
                gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
                gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
                gpuMetrics["gpu_temperature"] = cm.GPUTemperature
                break
            }
        }
    }
}
⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys

📄 internal/collector/historical_percentile_cache.go:107

In historical_percentile_cache.go:107, fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind) is used to parse keys of the form namespace/name/kind. However, %s in Go's fmt.Sscanf consumes until whitespace, not until /. The first %s will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

Suggested fix
// Replace fmt.Sscanf with strings.Split
parts := strings.SplitN(key, "/", 3)
if len(parts) == 3 {
    workloads = append(workloads, HistoricalWorkloadQuery{
        Namespace:    parts[0],
        WorkloadName: parts[1],
        WorkloadKind: parts[2],
    })
}
⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context

📄 internal/controller/collectionpolicy_controller.go:1989

In setupMpaServer (line 1989), go cache.Start(context.Background()) launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard if r.MpaServer != nil { return nil } prevents duplicates only if setupMpaServer isn't called after a full teardown.

Suggested fix
// Store a cancel func on the reconciler and use a derived context:
cacheCtx, cacheCancel := context.WithCancel(context.Background())
r.historicalCacheCancel = cacheCancel  // new field on reconciler
go cache.Start(cacheCtx)

// Then in teardown/Stop, call r.historicalCacheCancel()
🤖 Prompt for agents
Code Review: Decommissions the Prometheus build dependency in favor of a unified nodemon-based collector, but misses container GPU metrics due to an unpopulated map. Fails on key parsing in the historical percentile cache and lacks proper cancellation for the background cache routine.

1. 🚨 Bug: Container GPU metrics always empty — data never extracted from cache
   Files: internal/collector/container_resource_collector.go:325-327

   After removing the Prometheus GPU query path, `gpuMetrics` is unconditionally set to an empty map (`make(map[string]interface{})`) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from `UnifiedContainerMetric` and populates the `gpuMetrics` map before passing it to `processContainerMetrics`. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

   Suggested fix:
   // GPU metrics from nodemon unified cache
   if c.nodemonContainerMetricsCache != nil {
       podKey := pod.Namespace + "/" + pod.Name
       if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
           for _, cm := range cms {
               if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
                   gpuMetrics["gpu_utilization"] = cm.GPUUtilization
                   gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
                   gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
                   gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
                   gpuMetrics["gpu_temperature"] = cm.GPUTemperature
                   break
               }
           }
       }
   }

2. ⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys
   Files: internal/collector/historical_percentile_cache.go:107

   In `historical_percentile_cache.go:107`, `fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind)` is used to parse keys of the form `namespace/name/kind`. However, `%s` in Go's `fmt.Sscanf` consumes until whitespace, not until `/`. The first `%s` will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

   Suggested fix:
   // Replace fmt.Sscanf with strings.Split
   parts := strings.SplitN(key, "/", 3)
   if len(parts) == 3 {
       workloads = append(workloads, HistoricalWorkloadQuery{
           Namespace:    parts[0],
           WorkloadName: parts[1],
           WorkloadKind: parts[2],
       })
   }

3. ⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context
   Files: internal/controller/collectionpolicy_controller.go:1989

   In `setupMpaServer` (line 1989), `go cache.Start(context.Background())` launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard `if r.MpaServer != nil { return nil }` prevents duplicates only if setupMpaServer isn't called after a full teardown.

   Suggested fix:
   // Store a cancel func on the reconciler and use a derived context:
   cacheCtx, cacheCancel := context.WithCancel(context.Background())
   r.historicalCacheCancel = cacheCancel  // new field on reconciler
   go cache.Start(cacheCtx)
   
   // Then in teardown/Stop, call r.historicalCacheCancel()

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant