Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
cdfd297
docs: add design spec for removing Prometheus dependency
Parthiba-Hazra Apr 28, 2026
0b4a70c
Merge branch 'main' into ph/cut-the-prom-dep
Parthiba-Hazra Apr 29, 2026
3a4eef5
test: pin ContainerResourceCollector Prometheus transformation logic
Parthiba-Hazra Apr 29, 2026
390c073
test: pin NodeCollector and PVCMetricsCollector Prometheus transforma…
Parthiba-Hazra Apr 29, 2026
f382516
feat(nodemon): add RateCalculator for cAdvisor counter-to-rate conver…
Parthiba-Hazra Apr 29, 2026
36a955c
feat(nodemon): add kubelet stats/summary poller
Parthiba-Hazra Apr 29, 2026
f4348b4
feat(nodemon): add cAdvisor scraper with rate computation for throttl…
Parthiba-Hazra Apr 29, 2026
aba6f2a
feat(nodemon): add unified /v2/container/metrics, /node/metrics, /pvc…
Parthiba-Hazra Apr 29, 2026
796b467
feat: extend NodemonClient with unified container, node, and PVC metr…
Parthiba-Hazra Apr 29, 2026
6971013
feat(helm): add nodes/proxy RBAC to nodemon for kubelet endpoint access
Parthiba-Hazra Apr 29, 2026
d35e5b7
feat: add ENABLE_NODEMON_METRICS feature flag
Parthiba-Hazra Apr 29, 2026
8f2308c
refactor: extract HistoricalPercentileProvider interface for MPA server
Parthiba-Hazra Apr 29, 2026
28f6625
feat: add HistoricalPercentileCache backed by DAKR control plane
Parthiba-Hazra Apr 29, 2026
9637287
feat: wire nodemon path into ContainerResourceCollector behind featur…
Parthiba-Hazra Apr 29, 2026
aeff2cf
feat: wire HistoricalPercentileCache integration point into MPA serve…
Parthiba-Hazra Apr 29, 2026
3d3b701
feat: wire nodemon path into NodeCollector and PVCMetricsCollector be…
Parthiba-Hazra Apr 29, 2026
5787c69
test(nodemon): add integration test for full metrics flow
Parthiba-Hazra Apr 29, 2026
a1ffdf2
feat: implement PercentileFetcher on DakrClient for DAKR control plan…
Parthiba-Hazra Apr 29, 2026
82afdf5
feat(helm): make Prometheus components conditional, skip when nodemon…
Parthiba-Hazra Apr 29, 2026
b391b0e
feat(nodemon): wire unified endpoints into main.go with stats poller …
Parthiba-Hazra Apr 30, 2026
a642761
fix: skip Prometheus init when nodemon metrics are enabled
Parthiba-Hazra Apr 30, 2026
6cc043d
fix: add DISABLE_GPU_METRICS to Helm configmap and skip legacy GPU fe…
Parthiba-Hazra Apr 30, 2026
6e021f4
feat: get CPU/memory from nodemon instead of metrics-server when enabled
Parthiba-Hazra Apr 30, 2026
99f9fc7
feat: skip metrics-server installation when nodemon metrics enabled
Parthiba-Hazra Apr 30, 2026
d218d6e
fix: read ENABLE_NODEMON_METRICS from ConfigMap file mount in entrypoint
Parthiba-Hazra Apr 30, 2026
50f9f42
fix: initialize nodemon client when useNodemon=true regardless of GPU…
Parthiba-Hazra Apr 30, 2026
a155c13
fix: build node metrics list from informer when nodemon enabled
Parthiba-Hazra May 1, 2026
2fd2c14
fix: aggregate container CPU/memory from nodemon for node-level metrics
Parthiba-Hazra May 1, 2026
af767f4
refactor: simplify collectors by removing dead Prometheus fallbacks w…
Parthiba-Hazra May 1, 2026
712320e
refactor: remove all Prometheus query code, nodemon is the only data …
Parthiba-Hazra May 1, 2026
917f012
fix: relax IsAvailable to not require nodemon pods at startup
Parthiba-Hazra May 1, 2026
4782d09
fix: initialize nodemon client in constructor so IsAvailable works be…
Parthiba-Hazra May 1, 2026
d12a4d7
fix: compute network byte rates from cumulative stats/summary counters
Parthiba-Hazra May 1, 2026
2077294
fix: populate node network/disk metrics from cAdvisor and stats/summary
Parthiba-Hazra May 1, 2026
7ced44b
refactor: remove Prometheus from build system, unify nodemon as subchart
Parthiba-Hazra May 1, 2026
d9727ad
fix(helm): add namespace to nodemon ConfigMap templates
Parthiba-Hazra May 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,6 @@ WORKDIR /
COPY --from=builder /workspace/manager .
USER 65532:65532

COPY ./dist/metrics-server.yaml /metrics-server.yaml
COPY ./entrypoint.sh /entrypoint.sh

ENTRYPOINT ["/entrypoint.sh", "/manager"]
92 changes: 11 additions & 81 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -57,8 +57,6 @@ TESTSERVER_IMG ?= ttl.sh/zxporter-testserver:latest
STRESS_IMG ?= ttl.sh/zxporter-stress:latest
# DAKR URL to use for deployment
DAKR_URL ?= https://dakr.devzero.io
# PROMETHEUS URL for metrics collection
PROMETHEUS_URL ?= http://prometheus-dz-prometheus-server.$(DEVZERO_MONITORING_NAMESPACE).svc.cluster.local:80
# TARGET_NAMESPACES for limiting collection to specific namespaces (comma-separated)
TARGET_NAMESPACES ?=
# COLLECTION_FILE is used to control the collectionpolicies.
Expand All @@ -70,19 +68,13 @@ ENV_CONFIGMAP_FILE ?= config/manager/env_configmap.yaml
CLUSTER_TOKEN ?=

# Monitoring resources
PROMETHEUS_CHART_VERSION ?= 27.20.0
DEVZERO_MONITORING_NAMESPACE ?= devzero-system
NODE_EXPORTER_CHART_VERSION ?= 4.47.0
METRICS_SERVER_CHART_VERSION ?= 3.12.2

# DIST_INSTALL_BUNDLE is the final complete manifest
DIST_DIR ?= dist
DIST_INSTALL_BUNDLE ?= $(DIST_DIR)/install.yaml
DIST_BACKEND_INSTALL_BUNDLE ?= $(DIST_DIR)/backend-install.yaml
DIST_ZXPORTER_BUNDLE ?= $(DIST_DIR)/zxporter.yaml
DIST_PROMETHEUS_BUNDLE ?= $(DIST_DIR)/prometheus.yaml
DIST_NODE_EXPORTER_BUNDLE ?= $(DIST_DIR)/node-exporter.yaml
METRICS_SERVER ?= $(DIST_DIR)/metrics-server.yaml

# ENVTEST_K8S_VERSION refers to the version of kubebuilder assets to be downloaded by envtest binary.
ENVTEST_K8S_VERSION = 1.31.0
Expand Down Expand Up @@ -208,20 +200,6 @@ run: manifests generate fmt vet ## Run a controller from your host.
# More info: https://docs.docker.com/develop/develop-images/build_enhancements/
.PHONY: docker-build
docker-build: helm ## Build docker image with the manager.
@echo "[INFO] Adding Metrics Server repo"
@$(HELM) repo add metrics-server https://kubernetes-sigs.github.io/metrics-server/ >> /dev/null || true
@echo "[INFO] Fetching Metrics Server repo data"
@$(HELM) repo update metrics-server >> /dev/null

@echo "[INFO] Generate Metrics Server manifest"
@$(HELM) template metrics-server metrics-server/metrics-server \
--version $(METRICS_SERVER_CHART_VERSION) \
--namespace devzero-system \
--set args="{--kubelet-insecure-tls}" \
--set nameOverride="dz-metrics-server" \
--set fullnameOverride="dz-metrics-server" \
> $(METRICS_SERVER)

@echo "[INFO] For debug -> $(GO_VERSION), major $(MAJOR), minor $(MINOR), patch $(PATCH)"
$(CONTAINER_TOOL) build --load \
--build-arg MAJOR=$(MAJOR) \
Expand Down Expand Up @@ -292,29 +270,6 @@ docker-buildx: ## Build and push docker image for the manager for cross-platform
- $(CONTAINER_TOOL) buildx rm zxporter-builder
rm Dockerfile.cross

.PHONY: generate-monitoring-manifests
generate-monitoring-manifests: helm ## Generate monitoring manifests for Prometheus and Node Exporter.
@echo "[INFO] Adding Prometheus repo"
@$(HELM) repo add prometheus-community https://prometheus-community.github.io/helm-charts >> /dev/null || true
@echo "[INFO] Fetching prometheus repo data"
@$(HELM) repo update prometheus-community >> /dev/null

@echo "[INFO] Generate prometheus manifest"
@$(HELM) template prometheus prometheus-community/prometheus \
--version $(PROMETHEUS_CHART_VERSION) \
--namespace $(DEVZERO_MONITORING_NAMESPACE) \
--create-namespace \
--values config/prometheus/hack.prometheus.values.yaml \
> $(DIST_PROMETHEUS_BUNDLE)

@echo "[INFO] Generate Node Exporter manifest"
@$(HELM) template node-exporter prometheus-community/prometheus-node-exporter \
--version $(NODE_EXPORTER_CHART_VERSION) \
--namespace $(DEVZERO_MONITORING_NAMESPACE) \
--create-namespace \
--values config/prometheus/hack.node-exporter.values.yaml \
> $(DIST_NODE_EXPORTER_BUNDLE)

.PHONY: final-installer
final-installer:
@cp dist/install.yaml $(DIST_BACKEND_INSTALL_BUNDLE)
Expand All @@ -334,21 +289,11 @@ installer-without-configmap:
@$(YQ) -i 'select(.kind != "ConfigMap" or .metadata.name != "devzero-zxporter-env-config")' $(DIST_DIR)/installer_updater.yaml

.PHONY: build-installer
build-installer: manifests generate kustomize yq ## Generate a consolidated YAML with deployment.
build-installer: manifests generate kustomize yq helm ## Generate a consolidated YAML with deployment.
@mkdir -p $(DIST_DIR)

@echo "[INFO] Generating manifests for monitoring components..."
@$(MAKE) generate-monitoring-manifests
@echo "[INFO] Monitoring manifests generated."

@echo "[INFO] Generating installer bundle..."
@echo "## ATTN KUBERNETES ADMINS! Read this..." > $(DIST_INSTALL_BUNDLE)
@echo "# If prometheus-server is already installed, and you want to use that version," >> $(DIST_INSTALL_BUNDLE)
@echo "# comment out the section from \"START PROM SERVER\" to \"END PROM SERVER\" and update the \"prometheusURL\" variable." >> $(DIST_INSTALL_BUNDLE)
@echo -e "#" >> $(DIST_INSTALL_BUNDLE)
@echo "# If prometheus-node-exporter is already installed, and you want to use that version," >> $(DIST_INSTALL_BUNDLE)
@echo "# comment out the section from \"START PROM NODE EXPORTER\" to \"END PROM NODE EXPORTER\"" >> $(DIST_INSTALL_BUNDLE)
@echo -e "# \n" >> $(DIST_INSTALL_BUNDLE)
@echo "# ZXPorter installer bundle" > $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Adding namespace to the main installer"
@echo "apiVersion: v1" >> $(DIST_INSTALL_BUNDLE)
Expand All @@ -359,35 +304,27 @@ build-installer: manifests generate kustomize yq ## Generate a consolidated YAML
@echo " app.kubernetes.io/name: $(DEVZERO_MONITORING_NAMESPACE)" >> $(DIST_INSTALL_BUNDLE)
@echo " name: $(DEVZERO_MONITORING_NAMESPACE)" >> $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Append prometheus-server to the main installer"
@echo "# ----- START PROM SERVER -----" >> $(DIST_INSTALL_BUNDLE)
@cat $(DIST_PROMETHEUS_BUNDLE) >> $(DIST_INSTALL_BUNDLE)
@echo "# ----- END PROM SERVER -----" >> $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Append prometheus-node-exporter to the main installer"
@echo "# ----- START PROM NODE EXPORTER -----" >> $(DIST_INSTALL_BUNDLE)
@cat $(DIST_NODE_EXPORTER_BUNDLE) >> $(DIST_INSTALL_BUNDLE)
@echo "# ----- END PROM NODE EXPORTER -----" >> $(DIST_INSTALL_BUNDLE)

# @echo "[INFO] Append Metrics Server to the main installer"
# @echo "# ----- START METRICS SERVER -----" >> $(DIST_INSTALL_BUNDLE)
# @cat $(METRICS_SERVER) >> $(DIST_INSTALL_BUNDLE)
# @echo "# ----- END METRICS SERVER -----" >> $(DIST_INSTALL_BUNDLE)
@echo "---" >> $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Append zxporter-manager to the installer bundle"
@cd config/manager && $(KUSTOMIZE) edit set image controller=${IMG}

@echo "[INFO] Replacing env variables in configmap"
@$(YQ) e '.data.DAKR_URL = "$(DAKR_URL)"' -i $(ENV_CONFIGMAP_FILE)
@$(YQ) e '.data.PROMETHEUS_URL = "$(PROMETHEUS_URL)"' -i $(ENV_CONFIGMAP_FILE)
@$(YQ) e '.data.TARGET_NAMESPACES = "$(TARGET_NAMESPACES)"' -i $(ENV_CONFIGMAP_FILE)

@$(KUSTOMIZE) build config/default > $(DIST_ZXPORTER_BUNDLE)
@echo "[INFO] Patching cluster token into generated bundle"
@sed "s|CLUSTER_TOKEN: '{{ .cluster_token }}'|CLUSTER_TOKEN: \"$(CLUSTER_TOKEN)\"|g" $(DIST_ZXPORTER_BUNDLE) > $(DIST_ZXPORTER_BUNDLE).tmp && mv $(DIST_ZXPORTER_BUNDLE).tmp $(DIST_ZXPORTER_BUNDLE)
@cat $(DIST_ZXPORTER_BUNDLE) >> $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Generate and append nodemon DaemonSet to installer"
@$(HELM) template zxporter-nodemon ./helm-chart/zxporter-nodemon \
--namespace $(DEVZERO_MONITORING_NAMESPACE) \
--set provider=other \
> $(DIST_DIR)/nodemon.yaml
@cat $(DIST_DIR)/nodemon.yaml >> $(DIST_INSTALL_BUNDLE)

@echo "[INFO] Building backend installer"
@$(MAKE) final-installer

Expand All @@ -398,7 +335,6 @@ build-env-configmap:
echo "" > $(DIST_INSTALL_BUNDLE)
# Copy and patch environment config
sed "s|\$$(DAKR_URL)|$(DAKR_URL)|g" $(ENV_CONFIGMAP_FILE) > temp.yaml && mv temp.yaml $(ENV_CONFIGMAP_FILE)
sed "s|\$$(PROMETHEUS_URL)|$(PROMETHEUS_URL)|g" $(ENV_CONFIGMAP_FILE) > temp.yaml && mv temp.yaml $(ENV_CONFIGMAP_FILE)
sed "s|\$$(TARGET_NAMESPACES)|$(TARGET_NAMESPACES)|g" $(ENV_CONFIGMAP_FILE) > temp.yaml && mv temp.yaml $(ENV_CONFIGMAP_FILE)
$(KUSTOMIZE) build config/default | \
yq eval 'select(.kind == "ConfigMap" and .metadata.name == "devzero-zxporter-env-config")' - >> $(DIST_INSTALL_BUNDLE)
Expand Down Expand Up @@ -454,7 +390,6 @@ helm-chart-install-minimal: helm-chart-build ## Install only zxporter without mo
--namespace devzero-system \
--create-namespace \
--set monitoring.enabled=false \
--set zxporter.prometheusUrl="$(PROMETHEUS_URL)" \
--wait

.PHONY: helm-chart-uninstall
Expand Down Expand Up @@ -505,13 +440,8 @@ deploy-env-configmap: DIST_INSTALL_BUNDLE=$(DIST_DIR)/env_configmap.yaml
deploy-env-configmap: build-env-configmap
cat $(DIST_INSTALL_BUNDLE) | $(KUBECTL) apply -f -

.PHONY: undeploy-monitoring
undeploy-monitoring: ## Undeploy monitoring components.
$(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f $(DIST_NODE_EXPORTER_BUNDLE) || true
$(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f $(DIST_PROMETHEUS_BUNDLE) || true

.PHONY: undeploy
undeploy: kustomize undeploy-monitoring ## Undeploy controller from the K8s cluster specified in ~/.kube/config. Call with ignore-not-found=true to ignore resource not found errors during deletion.
undeploy: kustomize ## Undeploy controller from the K8s cluster specified in ~/.kube/config. Call with ignore-not-found=true to ignore resource not found errors during deletion.
$(KUSTOMIZE) build config/default | $(KUBECTL) delete --ignore-not-found=$(ignore-not-found) -f -

##@ Dependencies
Expand Down
13 changes: 1 addition & 12 deletions api/v1/collectionpolicy_types.go
Original file line number Diff line number Diff line change
Expand Up @@ -449,18 +449,7 @@ type Policies struct {
// If ClusterToken is not provided but PATToken is, the system will exchange it for a cluster token
PATToken string `json:"patToken,omitempty"`

// PrometheusURL is the URL of the Prometheus server to query for metrics
// If not provided, defaults to in-cluster Prometheus at "http://prometheus-service.monitoring.svc.cluster.local:8080"
// +optional
PrometheusURL string `json:"prometheusURL,omitempty"`

// DisableNetworkIOMetrics disables collection of container network and I/O metrics from Prometheus
// These metrics include network throughput, packet rates, and disk I/O operations
// Default is false, meaning metrics are collected by default
// +optional
DisableNetworkIOMetrics bool `json:"disableNetworkIOMetrics,omitempty"`

// DisableGpuMetrics disables collection of GPU metrics from Prometheus
// DisableGpuMetrics disables collection of GPU metrics
// These metrics include GPU utilization, memory usage, and temperature
// Default is false, meaning metrics are collected by default
// +optional
Expand Down
32 changes: 29 additions & 3 deletions cmd/zxporter-nodemon/main.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,13 +76,39 @@ func main() {
)
mapper := nodemon.NewMapper(cfg.NodeName, workloadResolver, logger)

// Create exporter
// Create GPU exporter
exporter := nodemon.NewExporter(cfg, dynClient, scraper, mapper, logger)

// Create HTTP handler and server
containerMetricsHandler := nodemon.NewContainerMetricsHandler(exporter, logger)
// Create a K8s-authenticated HTTP client for kubelet API proxy access
k8sTransport, err := rest.TransportFor(kubeConfig)
if err != nil {
logger.Error(err, "Failed to create K8s transport")
os.Exit(1)
}
k8sHTTPClient := &http.Client{Transport: k8sTransport, Timeout: 15 * time.Second}

// Use the K8s API server proxy for kubelet access (same as Cortex pattern)
apiProxyBase := kubeConfig.Host + "/api/v1/nodes/" + cfg.NodeName + "/proxy"
statsPoller := nodemon.NewStatsPoller(apiProxyBase, k8sHTTPClient, logger)
cadvisorScraper := nodemon.NewCAdvisorScraper(apiProxyBase, k8sHTTPClient, logger)

// Create unified exporter that combines all data sources
unifiedExporter := nodemon.NewUnifiedExporter(statsPoller, cadvisorScraper, exporter, cfg.NodeName, logger)

// Start unified collection loop (every 30 seconds)
collectionCtx, collectionCancel := context.WithCancel(context.Background())
defer collectionCancel()
go unifiedExporter.StartCollectionLoop(collectionCtx, 30*time.Second)

// Create HTTP handlers
containerMetricsHandler := nodemon.NewContainerMetricsHandler(exporter, logger) // GPU-only (backward compat)
mux := nodemon.NewServerMux(containerMetricsHandler)

// Register unified endpoints
mux.Handle("/v2/container/metrics", nodemon.NewUnifiedContainerHandler(unifiedExporter, logger))
mux.Handle("/node/metrics", nodemon.NewNodeMetricsHandler(unifiedExporter, logger))
mux.Handle("/pvc/metrics", nodemon.NewPVCMetricsHandler(unifiedExporter, logger))

server := &http.Server{
Addr: fmt.Sprintf(":%d", cfg.HTTPListenPort),
Handler: mux,
Expand Down
13 changes: 1 addition & 12 deletions config/crd/bases/devzero.io_collectionpolicies.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -658,16 +658,10 @@ spec:
type: string
disableGpuMetrics:
description: |-
DisableGpuMetrics disables collection of GPU metrics from Prometheus
DisableGpuMetrics disables collection of GPU metrics
These metrics include GPU utilization, memory usage, and temperature
Default is false, meaning metrics are collected by default
type: boolean
disableNetworkIOMetrics:
description: |-
DisableNetworkIOMetrics disables collection of container network and I/O metrics from Prometheus
These metrics include network throughput, packet rates, and disk I/O operations
Default is false, meaning metrics are collected by default
type: boolean
disabledCollectors:
description: |-
DisabledCollectors is a list of collector types to completely disable
Expand Down Expand Up @@ -701,11 +695,6 @@ spec:
PATToken is the Personal Access Token used for automatic cluster token exchange
If ClusterToken is not provided but PATToken is, the system will exchange it for a cluster token
type: string
prometheusURL:
description: |-
PrometheusURL is the URL of the Prometheus server to query for metrics
If not provided, defaults to in-cluster Prometheus at "http://prometheus-service.monitoring.svc.cluster.local:8080"
type: string
watchedCRDs:
description: WatchedCRDs is a list of custom resource definitions
to explicitly watch
Expand Down
2 changes: 0 additions & 2 deletions config/manager/env_configmap.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -9,14 +9,12 @@ data:
# PAT_TOKEN: "{{ .pat_token }}" # Uncomment to use PAT token (recommended to use Secret instead)
KUBE_CONTEXT_NAME: '{{ .kube_context_name }}'
DAKR_URL: "https://dakr.devzero.io"
PROMETHEUS_URL: "http://prometheus-dz-prometheus-server.devzero-system.svc.cluster.local:80"
K8S_PROVIDER: "{{ .k8s_provider }}"
COLLECTION_FREQUENCY: ""
BUFFER_SIZE: ""
EXCLUDED_NAMESPACES: ""
EXCLUDED_NODES: ""
TARGET_NAMESPACES: ""
DISABLE_NETWORK_IO_METRICS: ""
MASK_SECRET_DATA: ""
NODE_METRICS_INTERVAL: ""
WATCHED_CRDS: ""
Expand Down
24 changes: 0 additions & 24 deletions config/prometheus/hack.node-exporter.values.yaml

This file was deleted.

Loading
Loading