[biday prometheus] feat: remove Prometheus dependency, collect all metrics via nodemon.#376
[biday prometheus] feat: remove Prometheus dependency, collect all metrics via nodemon.#376Parthiba-Hazra wants to merge 14 commits into
Conversation
78b4b43 to
d314993
Compare
- Remove metrics-server and node-exporter install steps from k8s-compatibility-test - Remove Prometheus debug step from k8s-compatibility-test - Delete metrics-server-lifecycle-test workflow entirely (no longer needed)
- Build and push nodemon image alongside zxporter in CI - Deploy nodemon subchart with zxporter in Helm path - Pass IMG_NODEMON to make deploy for manifest path - Set nodemonMetrics.enabled=true in CI Helm values
| @@ -1369,6 +562,15 @@ subjects: | |||
| namespace: devzero-system | |||
| --- | |||
| apiVersion: v1 | |||
| kind: Secret | |||
| metadata: | |||
| name: devzero-zxporter-token | |||
| namespace: devzero-system | |||
| stringData: | |||
| CLUSTER_TOKEN: "{{ .cluster_token }}" | |||
| type: Opaque | |||
| --- | |||
| apiVersion: v1 | |||
There was a problem hiding this comment.
This might overwrite secret already present in the cluster during update.
| name: '{{.zxporter_namespace}}' | ||
| name: devzero-system |
There was a problem hiding this comment.
this might break ns being passed based on what ns zxporter is running, devzero-zxporter or devzero-system
GPU metrics from nodemon's unified endpoint were never extracted into the gpuMetrics map, leaving it always empty. Now populates from nodemonContainerMetricsCache when GPUUtilization > 0.
…container aggregation The previous approach summed container CPU/memory per node, which missed system processes (kubelet, kernel, etc.). Now uses the kubelet stats/summary node section which reports actual node-level usage including all system overhead — matching what metrics-server provided.
…he key parsing fmt.Sscanf with %s reads until whitespace, not until slash, so it consumed the entire key into the first variable. strings.SplitN correctly splits on '/' delimiter.
…routine leak cache.Start(context.Background()) was never cancelled on shutdown. Now uses the reconciler's ctx which is cancelled when the manager stops or collectors restart.
…write on update The installer_updater.yaml is used for upgrades, not fresh installs. Including the Secret with a template placeholder would overwrite the real cluster token already present in the cluster.
…DAKR backend
The DAKR backend templates {{.zxporter_namespace}} at serve-time based
on where zxporter is installed. Hardcoding devzero-system would break
updates for clusters using a different namespace.
Ordering: yq runs first (strips ConfigMap/Secret), then sed templates
the namespace — avoids yq mangling Go template syntax.
Bundled in dist/install.yaml so the existing curl | kubectl apply flow automatically cleans up legacy Prometheus resources on upgrade: - Dedicated ServiceAccount with least-privilege RBAC - Namespaced Role: only deletes exact named resources in zxporter's namespace - ClusterRole with resourceNames: only deletes zxporter's Prometheus RBAC - Cannot affect Prometheus in other namespaces or other clusters - Idempotent: --ignore-not-found on all deletes (safe for fresh installs) - Self-cleaning: Job + RBAC auto-delete after 5 minutes
Same cleanup Job as the kubectl path but as a Helm hook: - post-install: catches fresh installs on clusters with leftover kubectl-based Prometheus - post-upgrade: catches upgrades from old Helm chart with Prometheus - hook-delete-policy: before-hook-creation (re-run safe) + hook-succeeded (auto-cleanup) - Dedicated SA with resourceNames-scoped RBAC (same safety as kubectl path)
Clusters with old zxporter had nodemon installed separately via 'helm install zxporter-nodemon'. The new zxporter bundles nodemon as a subchart (named zxporter-zxporter-nodemon). Without cleanup, both DaemonSets would run — duplicate nodemon pods on every node. The migration job now also deletes the old standalone nodemon resources (DaemonSet, ConfigMaps, ServiceAccount, ClusterRole/Binding) by exact name. The orphaned Helm release Secret is harmless.
In the kubectl path, the new installer reuses the same resource names (zxporter-nodemon) as the old standalone install. kubectl apply updates them in-place — no cleanup needed, and deleting would remove what was just applied. Standalone nodemon cleanup stays in the Helm hook only, where the subchart produces different names (zxporter-zxporter-nodemon).
GPURequestCount and GPULimitCount were never set after the Prometheus removal — the old code read these from the pod spec during GPU metric collection which was removed. Now reads nvidia.com/gpu resource requests/limits directly from the container spec, separate from the nodemon GPU usage metrics. This fixes the dashboard showing '0 GPU Requests' for workloads that have nvidia.com/gpu in their resource requests.
| --- | ||
| # Dedicated ServiceAccount for the one-time migration cleanup job. | ||
| # Scoped to only delete specific named resources left by previous zxporter installs. | ||
| apiVersion: v1 | ||
| kind: ServiceAccount | ||
| metadata: | ||
| name: zxporter-prometheus-cleanup | ||
| namespace: devzero-system | ||
| labels: | ||
| app.kubernetes.io/name: zxporter-prometheus-cleanup | ||
| app.kubernetes.io/part-of: devzero-zxporter | ||
| --- | ||
| # Namespaced Role: can only delete specific named Prometheus resources in the zxporter namespace. | ||
| # NOTE: Does NOT delete standalone nodemon — the kubectl install path reuses the same | ||
| # resource names (zxporter-nodemon), so kubectl apply updates them in-place. |
There was a problem hiding this comment.
💡 Quality: Orphaned RBAC resources in kubectl migration path
In config/migration/prometheus-cleanup-job.yaml, the Job self-cleans via ttlSecondsAfterFinished: 300, but the accompanying ServiceAccount, Role, RoleBinding, ClusterRole, and ClusterRoleBinding have no cleanup mechanism. After the Job completes and is garbage-collected, these RBAC resources remain in the cluster indefinitely as stale objects. The Helm variant correctly handles this with hook-delete-policy.
Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them.
Suggested fix:
Add self-cleanup commands at the end of the Job script:
# Self-cleanup: remove migration RBAC resources
kubectl delete clusterrolebinding zxporter-prometheus-cleanup --ignore-not-found
kubectl delete clusterrole zxporter-prometheus-cleanup --ignore-not-found
kubectl delete rolebinding zxporter-prometheus-cleanup -n $NS --ignore-not-found
kubectl delete role zxporter-prometheus-cleanup -n $NS --ignore-not-found
kubectl delete serviceaccount zxporter-prometheus-cleanup -n $NS --ignore-not-found
Note: the SA deletion must come last or the final commands may fail. Alternatively, add delete permissions for these resources to the Role/ClusterRole.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
| serviceAccountName: zxporter-prometheus-cleanup | ||
| containers: | ||
| - name: cleanup | ||
| image: bitnami/kubectl:latest |
There was a problem hiding this comment.
💡 Security: Unpinned bitnami/kubectl:latest image in migration Jobs
Both migration job manifests use image: bitnami/kubectl:latest. This means:
- The image pulled is non-deterministic across environments and time.
- A compromised or incompatible future image could silently break or escalate the cleanup job.
- In air-gapped environments,
:latestmay not be available or may pull unexpectedly.
The Helm chart pins all other images to specific versions. Consider pinning this to a known-good version (e.g., bitnami/kubectl:1.31) for reproducibility and supply-chain safety.
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
CI failed: The build failed due to several static analysis violations detected by golangci-lint, including high cyclomatic complexity, code formatting issues, and unused declarations.OverviewMultiple linting violations were identified during the CI process by FailuresStatic Analysis Violations (confidence: high)
Summary
Code Review 👍 Approved with suggestions 5 resolved / 7 findingsRemoves Prometheus dependencies by transitioning to unified nodemon metrics, addressing GPU metric gaps, node CPU/memory reporting, and cache-related goroutine leaks. Consider pinning the 💡 Quality: Orphaned RBAC resources in kubectl migration path📄 config/migration/prometheus-cleanup-job.yaml:1-15 In Since this is a one-time migration, the impact is low (a handful of no-op RBAC objects), but it's worth considering adding a final cleanup step in the Job script itself that deletes its own RBAC resources, or documenting that operators should manually remove them. Suggested fix💡 Security: Unpinned
|
…aemonSet
[Title]
📚 Description of Changes
Provide an overview of your changes and why they’re needed. Link to any related issues (e.g., "Fixes #123"). If your PR fixes a bug, resolves a feature request, or updates documentation, please explain how.
What Changed:
(Describe the modifications, additions, or removals.)
Why This Change:
(Explain the problem this PR addresses or the improvement it provides.)
Affected Components:
(Which component does this change affect? - put x for all components)
Compose
K8s
Other (please specify)
❓ Motivation and Context
Why is this change required? What problem does it solve?
Context:
(Provide background information or link to related discussions/issues.)
Relevant Tasks/Issues:
(e.g., Fixes: #GitHub Issue)
🔍 Types of Changes
Indicate which type of changes your code introduces (check all that apply):
🔬 QA / Verification Steps
Describe the steps a reviewer should take to verify your changes:
make testto verify all tests pass.")make create-kind && make deploy.")✅ Global Checklist
Please check all boxes that apply:
Summary by Gitar
prometheus-cleanup-job.yamlto bothconfig/migrationand Helm templates as a post-install/upgrade hook.Jobusingbitnami/kubectlto remove legacy Prometheus components while preserving currentnodemonresources.Makefileto include the Prometheus cleanup job in the generatedDIST_INSTALL_BUNDLE.internal/collector/container_resource_collector.goto source GPU requests and limits directly from thepod.Spec.nodemonusage metrics to provide comprehensive GPU dashboard reporting.This will update automatically on new commits.