[biday prometheus] feat: remove Prometheus dependency, collect all metrics via nodemon. by Parthiba-Hazra · Pull Request #376 · devzero-inc/zxporter

Parthiba-Hazra · 2026-05-05T16:21:54Z

…aemonSet

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

BUGFIX: Non-breaking fix for an issue.
NEW FEATURE: Non-breaking addition of functionality.
BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
ENHANCEMENT: Improvement to existing functionality.
CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

(Step one: e.g., "Run make test to verify all tests pass.")
(Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
(Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

I have read and followed the CONTRIBUTING guidelines.
My code follows the code style of this project.
I have updated the documentation as needed.
I have added tests that cover my changes.
All new and existing tests have passed locally.
I have run this code in a local environment to verify functionality.
I have considered the security implications of this change.

Summary by Gitar

Prometheus migration:
- Added a prometheus-cleanup-job.yaml to both config/migration and Helm templates as a post-install/upgrade hook.
- Implemented a Job using bitnami/kubectl to remove legacy Prometheus components while preserving current nodemon resources.
Build process updates:
- Modified Makefile to include the Prometheus cleanup job in the generated DIST_INSTALL_BUNDLE.
GPU metrics collection:
- Refactored internal/collector/container_resource_collector.go to source GPU requests and limits directly from the pod.Spec.
- Integrated pod spec resources with existing nodemon usage metrics to provide comprehensive GPU dashboard reporting.

_{This will update automatically on new commits.}

…aemonSet

- Remove metrics-server and node-exporter install steps from k8s-compatibility-test - Remove Prometheus debug step from k8s-compatibility-test - Delete metrics-server-lifecycle-test workflow entirely (no longer needed)

- Build and push nodemon image alongside zxporter in CI - Deploy nodemon subchart with zxporter in Helm path - Pass IMG_NODEMON to make deploy for manifest path - Set nodemonMetrics.enabled=true in CI Helm values

sandipanpanda · 2026-05-06T12:18:35Z

@@ -1369,6 +562,15 @@ subjects:
    namespace: devzero-system
 ---
 apiVersion: v1
+kind: Secret
+metadata:
+  name: devzero-zxporter-token
+  namespace: devzero-system
+stringData:
+  CLUSTER_TOKEN: "{{ .cluster_token }}"
+type: Opaque
+---
+apiVersion: v1


This might overwrite secret already present in the cluster during update.

sandipanpanda · 2026-05-06T12:21:29Z

-  name: '{{.zxporter_namespace}}'
+  name: devzero-system


this might break ns being passed based on what ns zxporter is running, devzero-zxporter or devzero-system

GPU metrics from nodemon's unified endpoint were never extracted into the gpuMetrics map, leaving it always empty. Now populates from nodemonContainerMetricsCache when GPUUtilization > 0.

…container aggregation The previous approach summed container CPU/memory per node, which missed system processes (kubelet, kernel, etc.). Now uses the kubelet stats/summary node section which reports actual node-level usage including all system overhead — matching what metrics-server provided.

…he key parsing fmt.Sscanf with %s reads until whitespace, not until slash, so it consumed the entire key into the first variable. strings.SplitN correctly splits on '/' delimiter.

…routine leak cache.Start(context.Background()) was never cancelled on shutdown. Now uses the reconciler's ctx which is cancelled when the manager stops or collectors restart.

…write on update The installer_updater.yaml is used for upgrades, not fresh installs. Including the Secret with a template placeholder would overwrite the real cluster token already present in the cluster.

…DAKR backend The DAKR backend templates {{.zxporter_namespace}} at serve-time based on where zxporter is installed. Hardcoding devzero-system would break updates for clusters using a different namespace. Ordering: yq runs first (strips ConfigMap/Secret), then sed templates the namespace — avoids yq mangling Go template syntax.

Bundled in dist/install.yaml so the existing curl | kubectl apply flow automatically cleans up legacy Prometheus resources on upgrade: - Dedicated ServiceAccount with least-privilege RBAC - Namespaced Role: only deletes exact named resources in zxporter's namespace - ClusterRole with resourceNames: only deletes zxporter's Prometheus RBAC - Cannot affect Prometheus in other namespaces or other clusters - Idempotent: --ignore-not-found on all deletes (safe for fresh installs) - Self-cleaning: Job + RBAC auto-delete after 5 minutes

Same cleanup Job as the kubectl path but as a Helm hook: - post-install: catches fresh installs on clusters with leftover kubectl-based Prometheus - post-upgrade: catches upgrades from old Helm chart with Prometheus - hook-delete-policy: before-hook-creation (re-run safe) + hook-succeeded (auto-cleanup) - Dedicated SA with resourceNames-scoped RBAC (same safety as kubectl path)

Clusters with old zxporter had nodemon installed separately via 'helm install zxporter-nodemon'. The new zxporter bundles nodemon as a subchart (named zxporter-zxporter-nodemon). Without cleanup, both DaemonSets would run — duplicate nodemon pods on every node. The migration job now also deletes the old standalone nodemon resources (DaemonSet, ConfigMaps, ServiceAccount, ClusterRole/Binding) by exact name. The orphaned Helm release Secret is harmless.

In the kubectl path, the new installer reuses the same resource names (zxporter-nodemon) as the old standalone install. kubectl apply updates them in-place — no cleanup needed, and deleting would remove what was just applied. Standalone nodemon cleanup stays in the Helm hook only, where the subchart produces different names (zxporter-zxporter-nodemon).

GPURequestCount and GPULimitCount were never set after the Prometheus removal — the old code read these from the pod spec during GPU metric collection which was removed. Now reads nvidia.com/gpu resource requests/limits directly from the container spec, separate from the nodemon GPU usage metrics. This fixes the dashboard showing '0 GPU Requests' for workloads that have nvidia.com/gpu in their resource requests.

gitar-bot · 2026-05-07T07:02:51Z

+---
+# Dedicated ServiceAccount for the one-time migration cleanup job.
+# Scoped to only delete specific named resources left by previous zxporter installs.
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: zxporter-prometheus-cleanup
+  namespace: devzero-system
+  labels:
+    app.kubernetes.io/name: zxporter-prometheus-cleanup
+    app.kubernetes.io/part-of: devzero-zxporter
+---
+# Namespaced Role: can only delete specific named Prometheus resources in the zxporter namespace.
+# NOTE: Does NOT delete standalone nodemon — the kubectl install path reuses the same
+# resource names (zxporter-nodemon), so kubectl apply updates them in-place.


💡 Quality: Orphaned RBAC resources in kubectl migration path

In config/migration/prometheus-cleanup-job.yaml, the Job self-cleans via ttlSecondsAfterFinished: 300, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with hook-delete-policy.

Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

Suggested fix:

Add self-cleanup commands at the end of the Job script: # Self-cleanup: remove migration RBAC resources kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-07T07:02:53Z

+      serviceAccountName: zxporter-prometheus-cleanup
+      containers:
+      - name: cleanup
+        image: bitnami/kubectl:latest


💡 Security: Unpinned bitnami/kubectl:latest image in migration Jobs

Both migration job manifests use image: bitnami/kubectl:latest. This means:

The image pulled is non-deterministic across environments and time.

A compromised or incompatible future image could silently break or escalate the cleanup job.

In air-gapped environments, :latest may not be available or may pull unexpectedly.

The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., bitnami/kubectl:1.31) for reproducibility and supply-chain safety.

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-05-07T07:02:59Z

CI failed: The build failed due to several static analysis violations detected by golangci-lint, including high cyclomatic complexity, code formatting issues, and unused declarations.

Overview

Multiple linting violations were identified during the CI process by golangci-lint, causing the build to fail. These issues range from code quality and complexity concerns to unused code fragments.

Failures

Static Analysis Violations (confidence: high)

Type: build
Affected jobs: 74765329445
Related to change: yes
Root cause: The codebase contains several linting errors: high cyclomatic complexity in UnifiedExporter.Collect, unformatted code (gofmt), missing constants for magic strings, and unused variables/types.
Suggested fix:
1. Run gofmt -s -w . to address formatting issues.
2. Refactor (*UnifiedExporter).Collect in internal/nodemon/unified_exporter.go to reduce cyclomatic complexity (currently 34, needs to be < 30).
3. Replace magic string devzero-system with a constant in internal/collector/container_resource_collector.go.
4. Pre-allocate the results slice in internal/nodemon/cadvisor_scraper.go to optimize memory usage.
5. Remove unused type rateRecord and constant _ENV_ENABLE_NODEMON_METRICS.

Summary

Change-related failures: 1 (Linting errors blocking CI)
Infrastructure/flaky failures: 0
Recommended action: The developer should run golangci-lint locally to replicate these errors, apply the suggested refactorings, and ensure the code complies with project standards before pushing again.

Code Review 👍 Approved with suggestions 5 resolved / 7 findings

Removes Prometheus dependencies by transitioning to unified nodemon metrics, addressing GPU metric gaps, node CPU/memory reporting, and cache-related goroutine leaks. Consider pinning the bitnami/kubectl image version and cleaning up orphaned RBAC resources in the migration job.

💡 Quality: Orphaned RBAC resources in kubectl migration path

📄 config/migration/prometheus-cleanup-job.yaml:1-15

In config/migration/prometheus-cleanup-job.yaml, the Job self-cleans via ttlSecondsAfterFinished: 300, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with hook-delete-policy.

Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

Suggested fix

Add self-cleanup commands at the end of the Job script:

  # Self-cleanup: remove migration RBAC resources
  kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found

Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.

💡 Security: Unpinned bitnami/kubectl:latest image in migration Jobs

📄 config/migration/prometheus-cleanup-job.yaml:141 📄 helm-chart/zxporter/templates/prometheus-cleanup-job.yaml:163

Both migration job manifests use image: bitnami/kubectl:latest. This means:

The image pulled is non-deterministic across environments and time.
A compromised or incompatible future image could silently break or escalate the cleanup job.
In air-gapped environments, :latest may not be available or may pull unexpectedly.

The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., bitnami/kubectl:1.31) for reproducibility and supply-chain safety.

✅ 5 resolved

✅ Bug: GPU metrics always empty for containers after Prometheus removal

📄 internal/collector/container_resource_collector.go:325-327
The container resource collector sets gpuMetrics = make(map[string]interface{}) unconditionally at line 327, with a comment claiming GPU data flows through nodemonContainerMetricsCache. However, no code actually extracts GPU metrics (utilization, memory, power, temperature) from the cached UnifiedContainerMetric entries into the gpuMetrics map before passing it to processContainerMetrics. This is a regression: containers with GPUs will no longer report any GPU metrics.

✅ Bug: Node CPU/memory now only counts containers, misses system usage

📄 internal/collector/node_collector.go:456-470 📄 internal/collector/node_collector.go:605-606
The node collector now builds NodeMetrics by summing container CPU/memory from nodemon (lines 478-509). Previously it used the metrics-server NodeMetricses() API which reports actual node-level usage including kubelet, system daemons, and kernel overhead. The resulting cpuUtilizationPercent and memoryUtilizationPercent (line 605-606) will systematically underreport true node utilization, potentially by 10-30% on busy nodes. This could cause autoscaling or alerting logic to make incorrect decisions.

✅ Bug: fmt.Sscanf with %s cannot parse slash-delimited key

📄 internal/collector/historical_percentile_cache.go:107
In HistoricalPercentileCache.Refresh(), the fallback key parser uses fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind). Since %s matches all non-whitespace characters (including /), the first %s will greedily consume the entire key (e.g., "default/my-deploy/Deployment") into ns, leaving name and kind empty. This means on the very next refresh cycle, stale workloads parsed from cache keys will produce malformed queries with empty name/kind.

✅ Bug: Goroutine leak: cache.Start(context.Background()) never cancelled

📄 internal/controller/collectionpolicy_controller.go:1989
In setupMpaServer, a HistoricalPercentileCache is started in a goroutine with context.Background(). There's no stored cancel function, so this goroutine can never be stopped. If setupMpaServer is called multiple times (e.g., during policy reconciliation), or if the controller needs to shut down gracefully, these goroutines accumulate indefinitely. Each goroutine holds a ticker and periodically calls the DAKR API.

✅ Bug: Node CPU/memory underreported — only sums container metrics

📄 internal/collector/node_collector.go:478-492
The node collector (lines 478-509) builds NodeMetrics by summing only container-level CPU/memory from FetchAllContainerMetrics. This excludes system-level resource consumption (kubelet, kernel, kube-proxy, etc.) that the previous metrics-server NodeMetricses() API included. Nodes will appear to have significantly lower utilization than reality, which could lead to incorrect autoscaling decisions or capacity planning.

Consider using the metrics-server's NodeMetricses() API (which zxporter still has a metricsClient for) or the node-level CPU/memory from kubelet stats/summary (which nodemon's UnifiedNodeMetric could provide via a /node/metrics endpoint that includes total node CPU/memory).

🤖 Prompt for agents

Code Review: Removes Prometheus dependencies by transitioning to unified nodemon metrics, addressing GPU metric gaps, node CPU/memory reporting, and cache-related goroutine leaks. Consider pinning the `bitnami/kubectl` image version and cleaning up orphaned RBAC resources in the migration job.

1. 💡 Quality: Orphaned RBAC resources in kubectl migration path
   Files: config/migration/prometheus-cleanup-job.yaml:1-15

   In `config/migration/prometheus-cleanup-job.yaml`, the Job self-cleans via `ttlSecondsAfterFinished: 300`, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with `hook-delete-policy`.
   
   Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

   Suggested fix:
   Add self-cleanup commands at the end of the Job script:
   
     # Self-cleanup: remove migration RBAC resources
     kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
     kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
     kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
     kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
     kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found
   
   Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.

2. 💡 Security: Unpinned `bitnami/kubectl:latest` image in migration Jobs
   Files: config/migration/prometheus-cleanup-job.yaml:141, helm-chart/zxporter/templates/prometheus-cleanup-job.yaml:163

   Both migration job manifests use `image: bitnami/kubectl:latest`. This means:
   1. The image pulled is non-deterministic across environments and time.
   2. A compromised or incompatible future image could silently break or escalate the cleanup job.
   3. In air-gapped environments, `:latest` may not be available or may pull unexpectedly.
   
   The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., `bitnami/kubectl:1.31`) for reproducibility and supply-chain safety.

Tip

Comment Gitar fix CI to trigger a fix.

_{Was this helpful? React with 👍 / 👎 | Gitar}

feat: remove Prometheus dependency, collect all metrics via nodemon D…

d314993

…aemonSet