Biday prometheus by Parthiba-Hazra · Pull Request #375 · devzero-inc/zxporter

Parthiba-Hazra · 2026-05-05T15:53:03Z

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

BUGFIX: Non-breaking fix for an issue.
NEW FEATURE: Non-breaking addition of functionality.
BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
ENHANCEMENT: Improvement to existing functionality.
CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

(Step one: e.g., "Run make test to verify all tests pass.")
(Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
(Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

I have read and followed the CONTRIBUTING guidelines.
My code follows the code style of this project.
I have updated the documentation as needed.
I have added tests that cover my changes.
All new and existing tests have passed locally.
I have run this code in a local environment to verify functionality.
I have considered the security implications of this change.

Summary by Gitar

Removed Prometheus dependency:
- Fully decommissioned Prometheus server, node-exporter, and kube-state-metrics components.
- Simplified collectors by removing Prometheus-reliant fallbacks and query logic.
Unified metric collection:
- Extended nodemon to aggregate node, container, and PVC metrics via stats/summary and /metrics/cadvisor endpoints.
- Implemented a unified HTTP API for local metric consumption, replacing Prometheus as the primary data source.
Historical data refactoring:
- Replaced HistoricalMetricsCollector with HistoricalPercentileCache, fetching pre-computed workload percentiles directly from DAKR control plane.
New internal services:
- Introduced CAdvisorScraper for rate computation and PercentileFetcher for DAKR-backed historical metrics.
- Added comprehensive integration testing for the new unified collector pipeline.

_{This will update automatically on new commits.}

Covers data source replacement map (kubelet stats/summary + cAdvisor), nodemon DaemonSet extension, historical percentile caching from DAKR ClickHouse, and phased migration strategy with feature flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…tion logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…sion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implement StatsPoller to fetch and parse the kubelet /stats/summary endpoint, providing CPU, memory, network, and PVC volume metrics for all pods on a node without a Prometheus dependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e, disk I/O, network Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…/metrics endpoints Introduce three new HTTP handlers backed by a UnifiedQuerier interface that serves merged container metrics (CPU, memory, network, disk, throttle, GPU), node-level aggregated rates, and PVC storage metrics as JSON. All endpoints support GET-only with query-param filtering; tests cover JSON encoding, per-field values, filtering, 404 (no node data), and 405 (wrong method). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ic fetchers Add UnifiedContainerMetric, UnifiedNodeMetric, and UnifiedPVCMetric types mirroring nodemon's new unified endpoints, plus fetch*/FetchAll* methods for /v2/container/metrics, /node/metrics, and /pvc/metrics with httptest coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds HistoricalPercentileProvider interface to internal/collector/interface.go and updates MpaServer to depend on the interface rather than the concrete *HistoricalMetricsCollector, enabling future substitution with a DAKR-backed cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Implements HistoricalPercentileProvider with an in-memory cache that is populated from a PercentileFetcher (DAKR control plane) on a 15-minute ticker, so HPA recommendations no longer require a local Prometheus. Stale data is preserved on fetch errors and health is reported as Degraded. All five TDD test cases pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…e flag When EnableNodemonMetrics is true, the collector pre-fetches all container metrics from nodemon once per cycle and uses them for network, disk I/O, and CPU throttle metrics instead of querying Prometheus. Falls back to Prometheus if the nodemon fetch fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…r setup Pass PolicyConfig to setupMpaServer so EnableNodemonMetrics is accessible; establishes the historicalProvider wiring point for when the DAKR fetcher is available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…hind feature flag - Add EnableNodemonMetrics field to NodeCollectorConfig and PersistentVolumeClaimMetricsCollectorConfig - Add useNodemon bool and nodemonClient fields to both collector structs - Implement collectNodeNetworkIOMetricsFromNodemon mapping UnifiedNodeMetric fields to Prometheus key names - Extend UnifiedNodeMetric with packet/error/drop rate fields to support full network IO mapping - Implement getFilesystemUsageFromNodemon using FetchAllPVCMetrics with namespace+name matching - Add nodemon branch guard at top of collectNodeNetworkIOMetrics and getFilesystemUsageFromPrometheus - Initialize nodemon client in PVCMetricsCollector.Start when useNodemon is set; skip Prometheus init - Wire EnableNodemonMetrics flag through all 4 NodeCollector and 3 PVCMetricsCollector creation sites in the controller Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds TestIntegration_FullMetricsFlow to exercise the combined StatsPoller + CAdvisorScraper pipeline end-to-end using httptest mock servers — verifying CPU/memory/network from stats/summary and throttle, disk-I/O, and packet rates from cAdvisor across two scrape cycles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e percentile retrieval Wire the DAKR K8SService.GetWorkloadContainerPercentiles RPC into zxporter as a collector.PercentileFetcher implementation, replacing the Prometheus dependency for historical workload metrics in the MPA server path. - Add K8SServiceClient to RealDakrClient and initialize in NewDakrClient - Create DakrPercentileFetcher with rate-limited concurrent RPC calls - Map ContainerPercentileSummary (float64 cores/bytes) to ContainerHistoricalMetrics (int64 millicores/bytes) - Add NewPercentileFetcher() to DakrClient interface and both implementations - Wire fetcher into setupMpaServer when EnableNodemonMetrics is true - Add comprehensive tests for mapping, error handling, and nil safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…Metrics.enabled=true When nodemonMetrics.enabled=true in values.yaml, the Helm chart no longer deploys Prometheus server, kube-state-metrics, or node-exporter. The nodemon DaemonSet collects metrics directly from kubelet, making these components unnecessary. PROMETHEUS_URL is omitted from the ConfigMap and ENABLE_NODEMON_METRICS is set to "true" so the application can detect the mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…and cAdvisor scraper Add UnifiedExporter that combines stats/summary, cAdvisor, and GPU data into a single collection loop. Register /v2/container/metrics, /node/metrics, /pvc/metrics on the HTTP mux. Uses K8s API server proxy for kubelet access with proper auth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NodeCollector and ContainerResourceCollector were still trying to connect to Prometheus on startup even with useNodemon=true. Gate the initPrometheusClient call behind !c.useNodemon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…tch when nodemon enabled - Add disableGPUMetrics value to Helm chart and wire into configmap - Skip legacy /container/metrics GPU fetch when useNodemon=true (GPU data comes from unified /v2/container/metrics instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ContainerResourceCollector now builds PodMetricsList from nodemon's stats/summary data (usageNanoCores, workingSetBytes) instead of querying the Kubernetes Metrics API. NodeCollector gracefully handles metrics-server unavailability when nodemon is enabled. This eliminates the metrics-server dependency when nodemonMetrics is on. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When ENABLE_NODEMON_METRICS=true, the entrypoint script skips the metrics-server check and installation. CPU/memory now comes from nodemon's stats/summary data, not the Kubernetes Metrics API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The ConfigMap is mounted as files at /etc/zxporter/config/, not as env vars. The entrypoint now reads from the file mount to determine whether to skip metrics-server installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… flag The nodemon client was only initialized when DisableGPUMetrics=false. When GPU was disabled but nodemon metrics enabled, the client stayed nil and the collector fell back to metrics-server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When metrics-server is unavailable and useNodemon=true, the node metrics list was empty causing zero node_resource batches to be sent. Now builds the list from the node informer so the iteration loop runs and node resources are produced with network/disk data from nodemon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When useNodemon=true, the node collector now fetches all container metrics from nodemon and aggregates CPU (nanocores) and memory (working set bytes) per node. This populates the node Usage field so cluster overview dashboards show non-zero values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…hen nodemon enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…source Remove the useNodemon/EnableNodemonMetrics feature flag and all Prometheus query code paths from container, node, and PVC collectors. Nodemon is now always used for network, disk I/O, CPU throttle, GPU, and PVC filesystem metrics. This eliminates ~2600 lines of dual-path branching, Prometheus client initialization, and fallback logic. Key changes: - Remove PrometheusURL, QueryTimeout, DisableNetworkIOMetrics, and EnableNodemonMetrics from all collector configs, PolicyConfig, and the CollectionPolicy CRD Policies struct - Remove prometheusAPI field and all Prometheus client init from collectors - Remove Prometheus query methods (collectPodNetworkMetrics, collectContainerIOMetrics, collectContainerCPUThrottleMetrics, collectContainerGPUMetrics, collectNodeNetworkIOMetrics, collectNodeGPUMetrics, getFilesystemUsageFromPrometheus) - Rename nodemon methods to be the primary methods - Remove waitForPrometheusAvailability from controller - Remove HistoricalMetricsCollector (Prometheus-only); move HistoricalWorkloadQuery type to interface.go - Delete Prometheus pinning tests (*_prom_test.go) - Simplify entrypoint.sh to just start the application - Historical percentile cache (DAKR-backed) is now always initialized when DakrClient is available Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HasExporters() discovers pods dynamically with a 30s TTL cache. At startup, pods may not be discovered yet, causing IsAvailable() to return false and the collector to be skipped entirely. Now just checks that the nodemon client is initialized — the collection loop gracefully handles empty nodemon responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…fore Start The controller calls IsAvailable() before Start(). Nodemon client was initialized in Start(), so IsAvailable() always returned false (nil client). Move init to constructor for all three collectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NetworkRxBytes/TxBytes from kubelet stats/summary are cumulative totals. The MPA proto field NetworkReceiveBytesPerSec expects bytes/sec. Added RateCalculator to convert cumulative counters to per-second rates, matching what the old Prometheus rate(bytes_total[5m]) produced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

NodeMetricsResponse was a placeholder with all zeros. Now aggregates cAdvisor per-container rates (disk I/O, network packets/errors) into node-level totals, and computes node network byte rates from stats/summary cumulative counters using RateCalculator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove all Prometheus, node-exporter, and metrics-server references from: - Makefile (variables, docker-build metrics-server generation, generate-monitoring-manifests target, build-installer prom sections) - Dockerfile (metrics-server.yaml COPY) - config/manager/env_configmap.yaml (PROMETHEUS_URL, DISABLE_NETWORK_IO_METRICS) - config/prometheus/ directory (deleted entirely) - dist/ (prometheus.yaml, node-exporter.yaml, metrics-server.yaml deleted) - Helm chart configmap template and values.yaml (prometheusUrl removed) Add nodemon as subchart dependency of zxporter helm chart and generate nodemon manifest in build-installer target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ConfigMaps missing {{ .Release.Namespace }} caused them to be created in the default namespace when applied via kubectl apply -f, since kubectl doesn't override namespace on resources without one. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gitar-bot · 2026-05-05T15:56:20Z

+			// GPU metrics are embedded in the unified container metrics when available.
+			// No separate GPU collection needed — data flows through nodemonContainerMetricsCache.
+			gpuMetrics = make(map[string]interface{})


🚨 Bug: Container GPU metrics always empty — data never extracted from cache

After removing the Prometheus GPU query path, gpuMetrics is unconditionally set to an empty map (make(map[string]interface{})) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from UnifiedContainerMetric and populates the gpuMetrics map before passing it to processContainerMetrics. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

Suggested fix:

// GPU metrics from nodemon unified cache if c.nodemonContainerMetricsCache != nil { podKey := pod.Namespace + "/" + pod.Name if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok { for _, cm := range cms { if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 { gpuMetrics["gpu_utilization"] = cm.GPUUtilization gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts gpuMetrics["gpu_temperature"] = cm.GPUTemperature break } } } }

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-05T15:56:22Z

+		} else {
+			// Fallback: parse key "namespace/name/kind"
+			var ns, name, kind string
+			if _, err := fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind); err == nil {


⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys

In historical_percentile_cache.go:107, fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind) is used to parse keys of the form namespace/name/kind. However, %s in Go's fmt.Sscanf consumes until whitespace, not until /. The first %s will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

Suggested fix:

// Replace fmt.Sscanf with strings.Split parts := strings.SplitN(key, "/", 3) if len(parts) == 3 { workloads = append(workloads, HistoricalWorkloadQuery{ Namespace: parts[0], WorkloadName: parts[1], WorkloadKind: parts[2], }) }

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-05T15:56:25Z

+	if config != nil && r.DakrClient != nil {
+		fetcher := r.DakrClient.NewPercentileFetcher()
+		cache := collector.NewHistoricalPercentileCache(r.Log, fetcher, "" /* clusterID resolved from auth token */, r.HealthManager)
+		go cache.Start(context.Background())


⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context

In setupMpaServer (line 1989), go cache.Start(context.Background()) launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard if r.MpaServer != nil { return nil } prevents duplicates only if setupMpaServer isn't called after a full teardown.

Suggested fix:

// Store a cancel func on the reconciler and use a derived context: cacheCtx, cacheCancel := context.WithCancel(context.Background()) r.historicalCacheCancel = cacheCancel // new field on reconciler go cache.Start(cacheCtx) // Then in teardown/Stop, call r.historicalCacheCancel()

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-05T15:56:26Z

Code Review 🚫 Blocked 0 resolved / 3 findings

Decommissions the Prometheus build dependency in favor of a unified nodemon-based collector, but misses container GPU metrics due to an unpopulated map. Fails on key parsing in the historical percentile cache and lacks proper cancellation for the background cache routine.

🚨 Bug: Container GPU metrics always empty — data never extracted from cache

📄 internal/collector/container_resource_collector.go:325-327

After removing the Prometheus GPU query path, gpuMetrics is unconditionally set to an empty map (make(map[string]interface{})) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from UnifiedContainerMetric and populates the gpuMetrics map before passing it to processContainerMetrics. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

Suggested fix

// GPU metrics from nodemon unified cache
if c.nodemonContainerMetricsCache != nil {
    podKey := pod.Namespace + "/" + pod.Name
    if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
        for _, cm := range cms {
            if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
                gpuMetrics["gpu_utilization"] = cm.GPUUtilization
                gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
                gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
                gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
                gpuMetrics["gpu_temperature"] = cm.GPUTemperature
                break
            }
        }
    }
}

⚠️

Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys

📄 internal/collector/historical_percentile_cache.go:107

In historical_percentile_cache.go:107, fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind) is used to parse keys of the form namespace/name/kind. However, %s in Go's fmt.Sscanf consumes until whitespace, not until /. The first %s will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

Suggested fix

// Replace fmt.Sscanf with strings.Split
parts := strings.SplitN(key, "/", 3)
if len(parts) == 3 {
    workloads = append(workloads, HistoricalWorkloadQuery{
        Namespace:    parts[0],
        WorkloadName: parts[1],
        WorkloadKind: parts[2],
    })
}

⚠️

Bug: HistoricalPercentileCache goroutine started with uncancellable context

📄 internal/controller/collectionpolicy_controller.go:1989

In setupMpaServer (line 1989), go cache.Start(context.Background()) launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard if r.MpaServer != nil { return nil } prevents duplicates only if setupMpaServer isn't called after a full teardown.

Suggested fix

// Store a cancel func on the reconciler and use a derived context:
cacheCtx, cacheCancel := context.WithCancel(context.Background())
r.historicalCacheCancel = cacheCancel  // new field on reconciler
go cache.Start(cacheCtx)

// Then in teardown/Stop, call r.historicalCacheCancel()

🤖 Prompt for agents

Code Review: Decommissions the Prometheus build dependency in favor of a unified nodemon-based collector, but misses container GPU metrics due to an unpopulated map. Fails on key parsing in the historical percentile cache and lacks proper cancellation for the background cache routine.

1. 🚨 Bug: Container GPU metrics always empty — data never extracted from cache
   Files: internal/collector/container_resource_collector.go:325-327

   After removing the Prometheus GPU query path, `gpuMetrics` is unconditionally set to an empty map (`make(map[string]interface{})`) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from `UnifiedContainerMetric` and populates the `gpuMetrics` map before passing it to `processContainerMetrics`. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.

   Suggested fix:
   // GPU metrics from nodemon unified cache
   if c.nodemonContainerMetricsCache != nil {
       podKey := pod.Namespace + "/" + pod.Name
       if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
           for _, cm := range cms {
               if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
                   gpuMetrics["gpu_utilization"] = cm.GPUUtilization
                   gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
                   gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
                   gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
                   gpuMetrics["gpu_temperature"] = cm.GPUTemperature
                   break
               }
           }
       }
   }

2. ⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys
   Files: internal/collector/historical_percentile_cache.go:107

   In `historical_percentile_cache.go:107`, `fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind)` is used to parse keys of the form `namespace/name/kind`. However, `%s` in Go's `fmt.Sscanf` consumes until whitespace, not until `/`. The first `%s` will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.

   Suggested fix:
   // Replace fmt.Sscanf with strings.Split
   parts := strings.SplitN(key, "/", 3)
   if len(parts) == 3 {
       workloads = append(workloads, HistoricalWorkloadQuery{
           Namespace:    parts[0],
           WorkloadName: parts[1],
           WorkloadKind: parts[2],
       })
   }

3. ⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context
   Files: internal/controller/collectionpolicy_controller.go:1989

   In `setupMpaServer` (line 1989), `go cache.Start(context.Background())` launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard `if r.MpaServer != nil { return nil }` prevents duplicates only if setupMpaServer isn't called after a full teardown.

   Suggested fix:
   // Store a cancel func on the reconciler and use a derived context:
   cacheCtx, cacheCancel := context.WithCancel(context.Background())
   r.historicalCacheCancel = cacheCancel  // new field on reconciler
   go cache.Start(cacheCtx)
   
   // Then in teardown/Stop, call r.historicalCacheCancel()

_{Was this helpful? React with 👍 / 👎 | Gitar}

Parthiba-Hazra and others added 30 commits April 28, 2026 16:46

Merge branch 'main' into ph/cut-the-prom-dep

0b4a70c

test: pin ContainerResourceCollector Prometheus transformation logic

3a4eef5

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

test: pin NodeCollector and PVCMetricsCollector Prometheus transforma…

390c073

…tion logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(nodemon): add RateCalculator for cAdvisor counter-to-rate conver…

f382516

…sion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(nodemon): add cAdvisor scraper with rate computation for throttl…

f4348b4

…e, disk I/O, network Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(helm): add nodes/proxy RBAC to nodemon for kubelet endpoint access

6971013

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat: add ENABLE_NODEMON_METRICS feature flag

d35e5b7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

refactor: simplify collectors by removing dead Prometheus fallbacks w…

af767f4

…hen nodemon enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Parthiba-Hazra and others added 6 commits May 1, 2026 19:47

Parthiba-Hazra changed the title ~~Albida prometheus~~ Biday prometheus May 5, 2026

Parthiba-Hazra closed this May 5, 2026

gitar-bot Bot reviewed May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Biday prometheus#375

Biday prometheus#375
Parthiba-Hazra wants to merge 36 commits into
mainfrom
ph/cut-the-prom-dep

Parthiba-Hazra commented May 5, 2026 •

edited by gitar-bot Bot

Loading

Uh oh!

gitar-bot Bot May 5, 2026

Uh oh!

gitar-bot Bot May 5, 2026

Uh oh!

gitar-bot Bot May 5, 2026

Uh oh!

gitar-bot Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Parthiba-Hazra commented May 5, 2026 • edited by gitar-bot Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[Title]

📚 Description of Changes

❓ Motivation and Context

🔍 Types of Changes

🔬 QA / Verification Steps

✅ Global Checklist

Summary by Gitar

Uh oh!

gitar-bot Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot May 5, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Parthiba-Hazra commented May 5, 2026 •

edited by gitar-bot Bot

Loading

gitar-bot Bot commented May 5, 2026 •

edited

Loading