Skip to content

[biday prometheus] feat: remove Prometheus dependency, collect all metrics via nodemon.#376

Open
Parthiba-Hazra wants to merge 14 commits into
mainfrom
ph/cut-the-prom-dep
Open

[biday prometheus] feat: remove Prometheus dependency, collect all metrics via nodemon.#376
Parthiba-Hazra wants to merge 14 commits into
mainfrom
ph/cut-the-prom-dep

Conversation

@Parthiba-Hazra
Copy link
Copy Markdown
Collaborator

@Parthiba-Hazra Parthiba-Hazra commented May 5, 2026

…aemonSet

[Title]

📚 Description of Changes

Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.

  • What Changed:
    (Describe the modifications, additions, or removals.)

  • Why This Change:
    (Explain the problem this PR addresses or the improvement it provides.)

  • Affected Components:
    (Which component does this change affect? - put x for all components)

  • Compose

  • K8s

  • Other (please specify)

❓ Motivation and Context

Why is this change required? What problem does it solve?

  • Context:
    (Provide background information or link to related discussions/issues.)

  • Relevant Tasks/Issues:
    (e.g., Fixes: #GitHub Issue)

🔍 Types of Changes

Indicate which type of changes your code introduces (check all that apply):

  • BUGFIX: Non-breaking fix for an issue.
  • NEW FEATURE: Non-breaking addition of functionality.
  • BREAKING CHANGE: Fix or feature that causes existing functionality to not work as expected.
  • ENHANCEMENT: Improvement to existing functionality.
  • CHORE: Changes that do not affect production (e.g., documentation, build tooling, CI).

🔬 QA / Verification Steps

Describe the steps a reviewer should take to verify your changes:

  1. (Step one: e.g., "Run make test to verify all tests pass.")
  2. (Step two: e.g., "Deploy to a Kind cluster with make create-kind && make deploy.")
  3. (Additional steps as needed.)

✅ Global Checklist

Please check all boxes that apply:

  • I have read and followed the CONTRIBUTING guidelines.
  • My code follows the code style of this project.
  • I have updated the documentation as needed.
  • I have added tests that cover my changes.
  • All new and existing tests have passed locally.
  • I have run this code in a local environment to verify functionality.
  • I have considered the security implications of this change.

Summary by Gitar

  • Prometheus migration:
    • Added a prometheus-cleanup-job.yaml to both config/migration and Helm templates as a post-install/upgrade hook.
    • Implemented a Job using bitnami/kubectl to remove legacy Prometheus components while preserving current nodemon resources.
  • Build process updates:
    • Modified Makefile to include the Prometheus cleanup job in the generated DIST_INSTALL_BUNDLE.
  • GPU metrics collection:
    • Refactored internal/collector/container_resource_collector.go to source GPU requests and limits directly from the pod.Spec.
    • Integrated pod spec resources with existing nodemon usage metrics to provide comprehensive GPU dashboard reporting.

This will update automatically on new commits.

Comment thread internal/collector/container_resource_collector.go Outdated
Comment thread internal/collector/node_collector.go Outdated
Comment thread internal/collector/historical_percentile_cache.go Outdated
Comment thread internal/controller/collectionpolicy_controller.go Outdated
@Parthiba-Hazra Parthiba-Hazra force-pushed the ph/cut-the-prom-dep branch from 78b4b43 to d314993 Compare May 5, 2026 16:27
Comment thread internal/collector/node_collector.go Outdated
- Remove metrics-server and node-exporter install steps from k8s-compatibility-test
- Remove Prometheus debug step from k8s-compatibility-test
- Delete metrics-server-lifecycle-test workflow entirely (no longer needed)
- Build and push nodemon image alongside zxporter in CI
- Deploy nodemon subchart with zxporter in Helm path
- Pass IMG_NODEMON to make deploy for manifest path
- Set nodemonMetrics.enabled=true in CI Helm values
Comment on lines 824 to +573
@@ -1369,6 +562,15 @@ subjects:
namespace: devzero-system
---
apiVersion: v1
kind: Secret
metadata:
name: devzero-zxporter-token
namespace: devzero-system
stringData:
CLUSTER_TOKEN: "{{ .cluster_token }}"
type: Opaque
---
apiVersion: v1
Copy link
Copy Markdown
Contributor

@sandipanpanda sandipanpanda May 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might overwrite secret already present in the cluster during update.

Comment thread dist/installer_updater.yaml Outdated
Comment on lines +824 to +17
name: '{{.zxporter_namespace}}'
name: devzero-system
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might break ns being passed based on what ns zxporter is running, devzero-zxporter or devzero-system

GPU metrics from nodemon's unified endpoint were never extracted into
the gpuMetrics map, leaving it always empty. Now populates from
nodemonContainerMetricsCache when GPUUtilization > 0.
…container aggregation

The previous approach summed container CPU/memory per node, which missed
system processes (kubelet, kernel, etc.). Now uses the kubelet
stats/summary node section which reports actual node-level usage
including all system overhead — matching what metrics-server provided.
…he key parsing

fmt.Sscanf with %s reads until whitespace, not until slash, so it
consumed the entire key into the first variable. strings.SplitN
correctly splits on '/' delimiter.
…routine leak

cache.Start(context.Background()) was never cancelled on shutdown.
Now uses the reconciler's ctx which is cancelled when the manager
stops or collectors restart.
…write on update

The installer_updater.yaml is used for upgrades, not fresh installs.
Including the Secret with a template placeholder would overwrite the
real cluster token already present in the cluster.
…DAKR backend

The DAKR backend templates {{.zxporter_namespace}} at serve-time based
on where zxporter is installed. Hardcoding devzero-system would break
updates for clusters using a different namespace.

Ordering: yq runs first (strips ConfigMap/Secret), then sed templates
the namespace — avoids yq mangling Go template syntax.
Bundled in dist/install.yaml so the existing curl | kubectl apply flow
automatically cleans up legacy Prometheus resources on upgrade:
- Dedicated ServiceAccount with least-privilege RBAC
- Namespaced Role: only deletes exact named resources in zxporter's namespace
- ClusterRole with resourceNames: only deletes zxporter's Prometheus RBAC
- Cannot affect Prometheus in other namespaces or other clusters
- Idempotent: --ignore-not-found on all deletes (safe for fresh installs)
- Self-cleaning: Job + RBAC auto-delete after 5 minutes
Same cleanup Job as the kubectl path but as a Helm hook:
- post-install: catches fresh installs on clusters with leftover kubectl-based Prometheus
- post-upgrade: catches upgrades from old Helm chart with Prometheus
- hook-delete-policy: before-hook-creation (re-run safe) + hook-succeeded (auto-cleanup)
- Dedicated SA with resourceNames-scoped RBAC (same safety as kubectl path)
Clusters with old zxporter had nodemon installed separately via
'helm install zxporter-nodemon'. The new zxporter bundles nodemon as
a subchart (named zxporter-zxporter-nodemon). Without cleanup, both
DaemonSets would run — duplicate nodemon pods on every node.

The migration job now also deletes the old standalone nodemon resources
(DaemonSet, ConfigMaps, ServiceAccount, ClusterRole/Binding) by exact
name. The orphaned Helm release Secret is harmless.
In the kubectl path, the new installer reuses the same resource names
(zxporter-nodemon) as the old standalone install. kubectl apply updates
them in-place — no cleanup needed, and deleting would remove what was
just applied.

Standalone nodemon cleanup stays in the Helm hook only, where the
subchart produces different names (zxporter-zxporter-nodemon).
GPURequestCount and GPULimitCount were never set after the Prometheus
removal — the old code read these from the pod spec during GPU metric
collection which was removed. Now reads nvidia.com/gpu resource
requests/limits directly from the container spec, separate from the
nodemon GPU usage metrics.

This fixes the dashboard showing '0 GPU Requests' for workloads that
have nvidia.com/gpu in their resource requests.
Comment on lines +1 to +15
---
# Dedicated ServiceAccount for the one-time migration cleanup job.
# Scoped to only delete specific named resources left by previous zxporter installs.
apiVersion: v1
kind: ServiceAccount
metadata:
name: zxporter-prometheus-cleanup
namespace: devzero-system
labels:
app.kubernetes.io/name: zxporter-prometheus-cleanup
app.kubernetes.io/part-of: devzero-zxporter
---
# Namespaced Role: can only delete specific named Prometheus resources in the zxporter namespace.
# NOTE: Does NOT delete standalone nodemon — the kubectl install path reuses the same
# resource names (zxporter-nodemon), so kubectl apply updates them in-place.
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Quality: Orphaned RBAC resources in kubectl migration path

In config/migration/prometheus-cleanup-job.yaml, the Job self-cleans via ttlSecondsAfterFinished: 300, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with hook-delete-policy.

Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

Suggested fix:

Add self-cleanup commands at the end of the Job script:

  # Self-cleanup: remove migration RBAC resources
  kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found

Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

serviceAccountName: zxporter-prometheus-cleanup
containers:
- name: cleanup
image: bitnami/kubectl:latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Security: Unpinned bitnami/kubectl:latest image in migration Jobs

Both migration job manifests use image: bitnami/kubectl:latest. This means:

  1. The image pulled is non-deterministic across environments and time.
  2. A compromised or incompatible future image could silently break or escalate the cleanup job.
  3. In air-gapped environments, :latest may not be available or may pull unexpectedly.

The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., bitnami/kubectl:1.31) for reproducibility and supply-chain safety.

Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 7, 2026

CI failed: The build failed due to several static analysis violations detected by golangci-lint, including high cyclomatic complexity, code formatting issues, and unused declarations.

Overview

Multiple linting violations were identified during the CI process by golangci-lint, causing the build to fail. These issues range from code quality and complexity concerns to unused code fragments.

Failures

Static Analysis Violations (confidence: high)

  • Type: build
  • Affected jobs: 74765329445
  • Related to change: yes
  • Root cause: The codebase contains several linting errors: high cyclomatic complexity in UnifiedExporter.Collect, unformatted code (gofmt), missing constants for magic strings, and unused variables/types.
  • Suggested fix:
    1. Run gofmt -s -w . to address formatting issues.
    2. Refactor (*UnifiedExporter).Collect in internal/nodemon/unified_exporter.go to reduce cyclomatic complexity (currently 34, needs to be < 30).
    3. Replace magic string devzero-system with a constant in internal/collector/container_resource_collector.go.
    4. Pre-allocate the results slice in internal/nodemon/cadvisor_scraper.go to optimize memory usage.
    5. Remove unused type rateRecord and constant _ENV_ENABLE_NODEMON_METRICS.

Summary

  • Change-related failures: 1 (Linting errors blocking CI)
  • Infrastructure/flaky failures: 0
  • Recommended action: The developer should run golangci-lint locally to replicate these errors, apply the suggested refactorings, and ensure the code complies with project standards before pushing again.
Code Review 👍 Approved with suggestions 5 resolved / 7 findings

Removes Prometheus dependencies by transitioning to unified nodemon metrics, addressing GPU metric gaps, node CPU/memory reporting, and cache-related goroutine leaks. Consider pinning the bitnami/kubectl image version and cleaning up orphaned RBAC resources in the migration job.

💡 Quality: Orphaned RBAC resources in kubectl migration path

📄 config/migration/prometheus-cleanup-job.yaml:1-15

In config/migration/prometheus-cleanup-job.yaml, the Job self-cleans via ttlSecondsAfterFinished: 300, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with hook-delete-policy.

Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

Suggested fix
Add self-cleanup commands at the end of the Job script:

  # Self-cleanup: remove migration RBAC resources
  kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
  kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
  kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found

Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.
💡 Security: Unpinned bitnami/kubectl:latest image in migration Jobs

📄 config/migration/prometheus-cleanup-job.yaml:141 📄 helm-chart/zxporter/templates/prometheus-cleanup-job.yaml:163

Both migration job manifests use image: bitnami/kubectl:latest. This means:

  1. The image pulled is non-deterministic across environments and time.
  2. A compromised or incompatible future image could silently break or escalate the cleanup job.
  3. In air-gapped environments, :latest may not be available or may pull unexpectedly.

The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., bitnami/kubectl:1.31) for reproducibility and supply-chain safety.

✅ 5 resolved
Bug: GPU metrics always empty for containers after Prometheus removal

📄 internal/collector/container_resource_collector.go:325-327
The container resource collector sets gpuMetrics = make(map[string]interface{}) unconditionally at line 327, with a comment claiming GPU data flows through nodemonContainerMetricsCache. However, no code actually extracts GPU metrics (utilization, memory, power, temperature) from the cached UnifiedContainerMetric entries into the gpuMetrics map before passing it to processContainerMetrics. This is a regression: containers with GPUs will no longer report any GPU metrics.

Bug: Node CPU/memory now only counts containers, misses system usage

📄 internal/collector/node_collector.go:456-470 📄 internal/collector/node_collector.go:605-606
The node collector now builds NodeMetrics by summing container CPU/memory from nodemon (lines 478-509). Previously it used the metrics-server NodeMetricses() API which reports actual node-level usage including kubelet, system daemons, and kernel overhead. The resulting cpuUtilizationPercent and memoryUtilizationPercent (line 605-606) will systematically underreport true node utilization, potentially by 10-30% on busy nodes. This could cause autoscaling or alerting logic to make incorrect decisions.

Bug: fmt.Sscanf with %s cannot parse slash-delimited key

📄 internal/collector/historical_percentile_cache.go:107
In HistoricalPercentileCache.Refresh(), the fallback key parser uses fmt.Sscanf(key, "%s/%s/%s", &ns, &name, &kind). Since %s matches all non-whitespace characters (including /), the first %s will greedily consume the entire key (e.g., "default/my-deploy/Deployment") into ns, leaving name and kind empty. This means on the very next refresh cycle, stale workloads parsed from cache keys will produce malformed queries with empty name/kind.

Bug: Goroutine leak: cache.Start(context.Background()) never cancelled

📄 internal/controller/collectionpolicy_controller.go:1989
In setupMpaServer, a HistoricalPercentileCache is started in a goroutine with context.Background(). There's no stored cancel function, so this goroutine can never be stopped. If setupMpaServer is called multiple times (e.g., during policy reconciliation), or if the controller needs to shut down gracefully, these goroutines accumulate indefinitely. Each goroutine holds a ticker and periodically calls the DAKR API.

Bug: Node CPU/memory underreported — only sums container metrics

📄 internal/collector/node_collector.go:478-492
The node collector (lines 478-509) builds NodeMetrics by summing only container-level CPU/memory from FetchAllContainerMetrics. This excludes system-level resource consumption (kubelet, kernel, kube-proxy, etc.) that the previous metrics-server NodeMetricses() API included. Nodes will appear to have significantly lower utilization than reality, which could lead to incorrect autoscaling decisions or capacity planning.

Consider using the metrics-server's NodeMetricses() API (which zxporter still has a metricsClient for) or the node-level CPU/memory from kubelet stats/summary (which nodemon's UnifiedNodeMetric could provide via a /node/metrics endpoint that includes total node CPU/memory).

🤖 Prompt for agents
Code Review: Removes Prometheus dependencies by transitioning to unified nodemon metrics, addressing GPU metric gaps, node CPU/memory reporting, and cache-related goroutine leaks. Consider pinning the `bitnami/kubectl` image version and cleaning up orphaned RBAC resources in the migration job.

1. 💡 Quality: Orphaned RBAC resources in kubectl migration path
   Files: config/migration/prometheus-cleanup-job.yaml:1-15

   In `config/migration/prometheus-cleanup-job.yaml`, the Job self-cleans via `ttlSecondsAfterFinished: 300`, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with `hook-delete-policy`.
   
   Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.

   Suggested fix:
   Add self-cleanup commands at the end of the Job script:
   
     # Self-cleanup: remove migration RBAC resources
     kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
     kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
     kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
     kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
     kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found
   
   Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.

2. 💡 Security: Unpinned `bitnami/kubectl:latest` image in migration Jobs
   Files: config/migration/prometheus-cleanup-job.yaml:141, helm-chart/zxporter/templates/prometheus-cleanup-job.yaml:163

   Both migration job manifests use `image: bitnami/kubectl:latest`. This means:
   1. The image pulled is non-deterministic across environments and time.
   2. A compromised or incompatible future image could silently break or escalate the cleanup job.
   3. In air-gapped environments, `:latest` may not be available or may pull unexpectedly.
   
   The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., `bitnami/kubectl:1.31`) for reproducibility and supply-chain safety.

Tip

Comment Gitar fix CI to trigger a fix.

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants