Biday prometheus#375
Conversation
Covers data source replacement map (kubelet stats/summary + cAdvisor), nodemon DaemonSet extension, historical percentile caching from DAKR ClickHouse, and phased migration strategy with feature flags. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tion logic Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…sion Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implement StatsPoller to fetch and parse the kubelet /stats/summary endpoint, providing CPU, memory, network, and PVC volume metrics for all pods on a node without a Prometheus dependency. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e, disk I/O, network Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…/metrics endpoints Introduce three new HTTP handlers backed by a UnifiedQuerier interface that serves merged container metrics (CPU, memory, network, disk, throttle, GPU), node-level aggregated rates, and PVC storage metrics as JSON. All endpoints support GET-only with query-param filtering; tests cover JSON encoding, per-field values, filtering, 404 (no node data), and 405 (wrong method). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ic fetchers Add UnifiedContainerMetric, UnifiedNodeMetric, and UnifiedPVCMetric types mirroring nodemon's new unified endpoints, plus fetch*/FetchAll* methods for /v2/container/metrics, /node/metrics, and /pvc/metrics with httptest coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds HistoricalPercentileProvider interface to internal/collector/interface.go and updates MpaServer to depend on the interface rather than the concrete *HistoricalMetricsCollector, enabling future substitution with a DAKR-backed cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Implements HistoricalPercentileProvider with an in-memory cache that is populated from a PercentileFetcher (DAKR control plane) on a 15-minute ticker, so HPA recommendations no longer require a local Prometheus. Stale data is preserved on fetch errors and health is reported as Degraded. All five TDD test cases pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…e flag When EnableNodemonMetrics is true, the collector pre-fetches all container metrics from nodemon once per cycle and uses them for network, disk I/O, and CPU throttle metrics instead of querying Prometheus. Falls back to Prometheus if the nodemon fetch fails. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…r setup Pass PolicyConfig to setupMpaServer so EnableNodemonMetrics is accessible; establishes the historicalProvider wiring point for when the DAKR fetcher is available. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hind feature flag - Add EnableNodemonMetrics field to NodeCollectorConfig and PersistentVolumeClaimMetricsCollectorConfig - Add useNodemon bool and nodemonClient fields to both collector structs - Implement collectNodeNetworkIOMetricsFromNodemon mapping UnifiedNodeMetric fields to Prometheus key names - Extend UnifiedNodeMetric with packet/error/drop rate fields to support full network IO mapping - Implement getFilesystemUsageFromNodemon using FetchAllPVCMetrics with namespace+name matching - Add nodemon branch guard at top of collectNodeNetworkIOMetrics and getFilesystemUsageFromPrometheus - Initialize nodemon client in PVCMetricsCollector.Start when useNodemon is set; skip Prometheus init - Wire EnableNodemonMetrics flag through all 4 NodeCollector and 3 PVCMetricsCollector creation sites in the controller Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds TestIntegration_FullMetricsFlow to exercise the combined StatsPoller + CAdvisorScraper pipeline end-to-end using httptest mock servers — verifying CPU/memory/network from stats/summary and throttle, disk-I/O, and packet rates from cAdvisor across two scrape cycles. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e percentile retrieval Wire the DAKR K8SService.GetWorkloadContainerPercentiles RPC into zxporter as a collector.PercentileFetcher implementation, replacing the Prometheus dependency for historical workload metrics in the MPA server path. - Add K8SServiceClient to RealDakrClient and initialize in NewDakrClient - Create DakrPercentileFetcher with rate-limited concurrent RPC calls - Map ContainerPercentileSummary (float64 cores/bytes) to ContainerHistoricalMetrics (int64 millicores/bytes) - Add NewPercentileFetcher() to DakrClient interface and both implementations - Wire fetcher into setupMpaServer when EnableNodemonMetrics is true - Add comprehensive tests for mapping, error handling, and nil safety Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Metrics.enabled=true When nodemonMetrics.enabled=true in values.yaml, the Helm chart no longer deploys Prometheus server, kube-state-metrics, or node-exporter. The nodemon DaemonSet collects metrics directly from kubelet, making these components unnecessary. PROMETHEUS_URL is omitted from the ConfigMap and ENABLE_NODEMON_METRICS is set to "true" so the application can detect the mode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…and cAdvisor scraper Add UnifiedExporter that combines stats/summary, cAdvisor, and GPU data into a single collection loop. Register /v2/container/metrics, /node/metrics, /pvc/metrics on the HTTP mux. Uses K8s API server proxy for kubelet access with proper auth. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NodeCollector and ContainerResourceCollector were still trying to connect to Prometheus on startup even with useNodemon=true. Gate the initPrometheusClient call behind !c.useNodemon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tch when nodemon enabled - Add disableGPUMetrics value to Helm chart and wire into configmap - Skip legacy /container/metrics GPU fetch when useNodemon=true (GPU data comes from unified /v2/container/metrics instead) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ContainerResourceCollector now builds PodMetricsList from nodemon's stats/summary data (usageNanoCores, workingSetBytes) instead of querying the Kubernetes Metrics API. NodeCollector gracefully handles metrics-server unavailability when nodemon is enabled. This eliminates the metrics-server dependency when nodemonMetrics is on. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When ENABLE_NODEMON_METRICS=true, the entrypoint script skips the metrics-server check and installation. CPU/memory now comes from nodemon's stats/summary data, not the Kubernetes Metrics API. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The ConfigMap is mounted as files at /etc/zxporter/config/, not as env vars. The entrypoint now reads from the file mount to determine whether to skip metrics-server installation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… flag The nodemon client was only initialized when DisableGPUMetrics=false. When GPU was disabled but nodemon metrics enabled, the client stayed nil and the collector fell back to metrics-server. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When metrics-server is unavailable and useNodemon=true, the node metrics list was empty causing zero node_resource batches to be sent. Now builds the list from the node informer so the iteration loop runs and node resources are produced with network/disk data from nodemon. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When useNodemon=true, the node collector now fetches all container metrics from nodemon and aggregates CPU (nanocores) and memory (working set bytes) per node. This populates the node Usage field so cluster overview dashboards show non-zero values. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hen nodemon enabled Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…source Remove the useNodemon/EnableNodemonMetrics feature flag and all Prometheus query code paths from container, node, and PVC collectors. Nodemon is now always used for network, disk I/O, CPU throttle, GPU, and PVC filesystem metrics. This eliminates ~2600 lines of dual-path branching, Prometheus client initialization, and fallback logic. Key changes: - Remove PrometheusURL, QueryTimeout, DisableNetworkIOMetrics, and EnableNodemonMetrics from all collector configs, PolicyConfig, and the CollectionPolicy CRD Policies struct - Remove prometheusAPI field and all Prometheus client init from collectors - Remove Prometheus query methods (collectPodNetworkMetrics, collectContainerIOMetrics, collectContainerCPUThrottleMetrics, collectContainerGPUMetrics, collectNodeNetworkIOMetrics, collectNodeGPUMetrics, getFilesystemUsageFromPrometheus) - Rename nodemon methods to be the primary methods - Remove waitForPrometheusAvailability from controller - Remove HistoricalMetricsCollector (Prometheus-only); move HistoricalWorkloadQuery type to interface.go - Delete Prometheus pinning tests (*_prom_test.go) - Simplify entrypoint.sh to just start the application - Historical percentile cache (DAKR-backed) is now always initialized when DakrClient is available Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HasExporters() discovers pods dynamically with a 30s TTL cache. At startup, pods may not be discovered yet, causing IsAvailable() to return false and the collector to be skipped entirely. Now just checks that the nodemon client is initialized — the collection loop gracefully handles empty nodemon responses. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…fore Start The controller calls IsAvailable() before Start(). Nodemon client was initialized in Start(), so IsAvailable() always returned false (nil client). Move init to constructor for all three collectors. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NetworkRxBytes/TxBytes from kubelet stats/summary are cumulative totals. The MPA proto field NetworkReceiveBytesPerSec expects bytes/sec. Added RateCalculator to convert cumulative counters to per-second rates, matching what the old Prometheus rate(bytes_total[5m]) produced. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NodeMetricsResponse was a placeholder with all zeros. Now aggregates cAdvisor per-container rates (disk I/O, network packets/errors) into node-level totals, and computes node network byte rates from stats/summary cumulative counters using RateCalculator. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove all Prometheus, node-exporter, and metrics-server references from: - Makefile (variables, docker-build metrics-server generation, generate-monitoring-manifests target, build-installer prom sections) - Dockerfile (metrics-server.yaml COPY) - config/manager/env_configmap.yaml (PROMETHEUS_URL, DISABLE_NETWORK_IO_METRICS) - config/prometheus/ directory (deleted entirely) - dist/ (prometheus.yaml, node-exporter.yaml, metrics-server.yaml deleted) - Helm chart configmap template and values.yaml (prometheusUrl removed) Add nodemon as subchart dependency of zxporter helm chart and generate nodemon manifest in build-installer target. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ConfigMaps missing {{ .Release.Namespace }} caused them to be created
in the default namespace when applied via kubectl apply -f, since
kubectl doesn't override namespace on resources without one.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| // GPU metrics are embedded in the unified container metrics when available. | ||
| // No separate GPU collection needed — data flows through nodemonContainerMetricsCache. | ||
| gpuMetrics = make(map[string]interface{}) |
There was a problem hiding this comment.
🚨 Bug: Container GPU metrics always empty — data never extracted from cache
After removing the Prometheus GPU query path, gpuMetrics is unconditionally set to an empty map (make(map[string]interface{})) at line 327. The comment claims "data flows through nodemonContainerMetricsCache" but there is no code that reads the GPU fields (GPUUtilization, GPUMemoryUsedMiB, etc.) from UnifiedContainerMetric and populates the gpuMetrics map before passing it to processContainerMetrics. This means container-level GPU metrics (utilization, memory, power, temperature) are silently lost for all GPU workloads — a regression from the previous Prometheus+nodemon path.
Suggested fix:
// GPU metrics from nodemon unified cache
if c.nodemonContainerMetricsCache != nil {
podKey := pod.Namespace + "/" + pod.Name
if cms, ok := c.nodemonContainerMetricsCache[podKey]; ok {
for _, cm := range cms {
if cm.Container == containerMetrics.Name && cm.GPUUtilization > 0 {
gpuMetrics["gpu_utilization"] = cm.GPUUtilization
gpuMetrics["gpu_memory_used_mib"] = cm.GPUMemoryUsedMiB
gpuMetrics["gpu_memory_free_mib"] = cm.GPUMemoryFreeMiB
gpuMetrics["gpu_power_watts"] = cm.GPUPowerWatts
gpuMetrics["gpu_temperature"] = cm.GPUTemperature
break
}
}
}
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| } else { | ||
| // Fallback: parse key "namespace/name/kind" | ||
| var ns, name, kind string | ||
| if _, err := fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind); err == nil { |
There was a problem hiding this comment.
⚠️ Bug: fmt.Sscanf with "%s/%s/%s" cannot parse slash-separated keys
In historical_percentile_cache.go:107, fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind) is used to parse keys of the form namespace/name/kind. However, %s in Go's fmt.Sscanf consumes until whitespace, not until /. The first %s will consume the entire string (e.g., "default/my-deploy/Deployment") leaving nothing for the other two verbs. This means the fallback key parsing always fails silently, and those workloads won't be refreshed.
Suggested fix:
// Replace fmt.Sscanf with strings.Split
parts := strings.SplitN(key, "/", 3)
if len(parts) == 3 {
workloads = append(workloads, HistoricalWorkloadQuery{
Namespace: parts[0],
WorkloadName: parts[1],
WorkloadKind: parts[2],
})
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| if config != nil && r.DakrClient != nil { | ||
| fetcher := r.DakrClient.NewPercentileFetcher() | ||
| cache := collector.NewHistoricalPercentileCache(r.Log, fetcher, "" /* clusterID resolved from auth token */, r.HealthManager) | ||
| go cache.Start(context.Background()) |
There was a problem hiding this comment.
⚠️ Bug: HistoricalPercentileCache goroutine started with uncancellable context
In setupMpaServer (line 1989), go cache.Start(context.Background()) launches a goroutine with a context that can never be cancelled. If the controller is stopped or the MPA server needs to be torn down, this goroutine (and its 15-minute ticker) will leak and continue making DAKR API calls indefinitely. The guard if r.MpaServer != nil { return nil } prevents duplicates only if setupMpaServer isn't called after a full teardown.
Suggested fix:
// Store a cancel func on the reconciler and use a derived context:
cacheCtx, cacheCancel := context.WithCancel(context.Background())
r.historicalCacheCancel = cacheCancel // new field on reconciler
go cache.Start(cacheCtx)
// Then in teardown/Stop, call r.historicalCacheCancel()
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 🚫 Blocked 0 resolved / 3 findingsDecommissions the Prometheus build dependency in favor of a unified nodemon-based collector, but misses container GPU metrics due to an unpopulated map. Fails on key parsing in the historical percentile cache and lacks proper cancellation for the background cache routine. 🚨 Bug: Container GPU metrics always empty — data never extracted from cache📄 internal/collector/container_resource_collector.go:325-327 After removing the Prometheus GPU query path, Suggested fix
|
[Title]
📚 Description of Changes
Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.
What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)
❓ Motivation and Context
Why is this change required? What problem does it solve?
Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)
🔍 Types of Changes
Indicate which type of changes your code introduces (check all that apply):
🔬 QA / Verification Steps
Describe the steps a reviewer should take to verify your changes:
make testto verify all tests pass.")make create-kind && make deploy.")✅ Global Checklist
Please check all boxes that apply:
Summary by Gitar
node-exporter, andkube-state-metricscomponents.nodemonto aggregate node, container, and PVC metrics viastats/summaryand/metrics/cadvisorendpoints.HistoricalMetricsCollectorwithHistoricalPercentileCache, fetching pre-computed workload percentiles directly from DAKR control plane.CAdvisorScraperfor rate computation andPercentileFetcherfor DAKR-backed historical metrics.This will update automatically on new commits.