pvc metrics collector added by Rupam-It · Pull Request #252 · devzero-inc/zxporter

Rupam-It · 2026-01-23T11:37:41Z

pvc metrics collector added

📚 Description of Changes

In this PR , I have extend the core functionality of metrics collector with adding a new metrics collector of PVC, Now zxporter can also send pvc along side of container metrics!

What Changed:
I have a new pvc collector ! that will collect metrics of pvc stuff!
Why This Change:
this will change the extendability for add metrics for pvc as well !

coderabbitai · 2026-01-23T11:38:10Z

📝 Walkthrough

Walkthrough

Adds a new PersistentVolumeClaimMetrics resource type, a strongly-typed PVC metrics snapshot, a Prometheus-backed PVC metrics collector with informer-driven listing and batching, controller wiring to register/start/stop the collector, and a local Helm values file.

Changes

Cohort / File(s)	Summary
Resource Type Definition `internal/collector/interface.go`	Added `PersistentVolumeClaimMetrics` to `ResourceType`, mapped to `"pvc_metrics"`, and updated `ProtoType()` handling.
Data Model `internal/collector/pvc_metrics.go`	Added `PersistentVolumeClaimMetricsSnapshot` struct (PVC identity, binding/storage metadata, usage metrics, stats availability; JSON-tagged).
Collector Implementation `internal/collector/pvc_metrics_collector.go`	New `PersistentVolumeClaimMetricsCollector` and `PersistentVolumeClaimMetricsCollectorConfig`; constructor, Start/Stop, collection loop, Prometheus queries for filesystem usage, per-PVC processing, batching/emission, helper functions, exclusions, and availability checks.
Controller Integration `internal/controller/collectionpolicy_controller.go`	Register, construct, start/stop, and dynamically enable/disable the `PersistentVolumeClaimMetrics` collector; integrated into restart and dynamic-enable flows.
Formatting Only `internal/collector/pvc_collector.go`	Minor composite-literal formatting change; no behavioral change.
Helm Values `helm-chart/zxporter/local_values.yaml`	Added local Helm values file with ZXPorter defaults for deployment, monitoring, tokens, scheduling, and HA settings.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant Collector as PVC Metrics Collector
    participant K8s as Kubernetes Informers
    participant Prometheus
    participant Batcher

    Controller->>Collector: Start(ctx)
    Collector->>K8s: Init informers & WaitForCacheSync
    K8s-->>Collector: Cache ready (PVCs)
    Collector->>Batcher: Start batcher
    Collector->>Collector: Launch collectMetricsLoop
    loop Periodic collection
        Collector->>K8s: List PVCs (namespace-scoped or all)
        K8s-->>Collector: PVC list
        Collector->>Collector: Filter exclusions / skip unbound
        loop Per-PVC
            alt filesystem-backed
                Collector->>Prometheus: Query used/capacity/available
                Prometheus-->>Collector: metrics or errors
            else block-backed
                Collector-->>Collector: mark stats unavailable
            end
            Collector->>Batcher: emitSnapshot(snapshot)
        end
        Batcher-->>Controller: Send batched snapshots -> output channel
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

container update changes were being reported as type unspecified #212 — Modifies the same ResourceType enum (adds enum member and updates String/ProtoType mappings); closely related to enum changes here.

Suggested labels

enhancement

Suggested reviewers

dray92
sandipanpanda

Poem

🐰 I hopped in with metrics bright and keen,
I asked Prometheus what the bytes had seen.
I batched each PVC’s little storage song,
Sent snapshots hopping swiftly along. ✨📦

🚥 Pre-merge checks | ✅ 1 | ❌ 4

❌ Failed checks (2 warnings, 2 inconclusive)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The helm-chart/zxporter/local_values.yaml addition appears to be a configuration file unrelated to the core PVC metrics collector implementation and enum changes.	Remove the helm-chart/zxporter/local_values.yaml file or provide justification for its inclusion within the scope of PVC metrics collector functionality.
Docstring Coverage	⚠️ Warning	Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'pvc metrics collector added' is vague and uses generic phrasing that doesn't clearly convey the specific changes or the importance of the metric collection functionality.	Consider using a more descriptive title like 'Add PersistentVolumeClaim metrics collector' to better clarify the primary change and its scope.
Linked Issues check	❓ Inconclusive	The linked issue `#123` 'cluster snapshot enum name fix' appears unrelated to the primary changes involving PVC metrics collection, making it impossible to validate compliance.	Clarify the relationship between issue `#123` and this PR, or link the correct issue(s) that define the requirements for PVC metrics collector implementation.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/collector/interface.go (1)
207-309: Add ProtoType mapping for PersistentVolumeClaimMetrics.

Without a case in ProtoType(), this new resource type will serialize as RESOURCE_TYPE_UNSPECIFIED, so pvc-metrics resources will be mislabeled on the wire.
🔧 Proposed fix
 case PersistentVolume:
 	return gen.ResourceType_RESOURCE_TYPE_PERSISTENT_VOLUME
+case PersistentVolumeClaimMetrics:
+	return gen.ResourceType_RESOURCE_TYPE_PVC_METRICS // use the actual proto enum name
 case StorageClass:
 	return gen.ResourceType_RESOURCE_TYPE_STORAGE_CLASS
internal/controller/collectionpolicy_controller.go (1)
658-661: Config changes won’t restart pvc_metrics.

identifyAffectedCollectors updates PVC resource collectors, but doesn’t include pvc_metrics for PVC exclusion or Prometheus/update-interval changes, so metrics can keep stale config until a full restart.
🔧 Proposed fix
 if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) {
 	affectedCollectors["persistent_volume_claim"] = true
+	affectedCollectors["pvc_metrics"] = true
 }

 if oldConfig.UpdateInterval != newConfig.UpdateInterval ||
 	oldConfig.PrometheusURL != newConfig.PrometheusURL ||
 	oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics ||
 	oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics {
 	affectedCollectors["node"] = true
 	affectedCollectors["container_resource"] = true
+	affectedCollectors["pvc_metrics"] = true
 }
Also applies to: 787-794

🤖 Fix all issues with AI agents

In `@internal/collector/pvc_metrics_collector.go`:
- Around line 176-188: Shutdown can race: collectAllPVCMetrics (from
collectMetricsLoop) may send to batchChan after Stop() closes it causing a
send-on-closed panic; change shutdown to coordinate goroutines so batchChan is
closed only after the collector loop exits. Modify
collectMetricsLoop/collectAllPVCMetrics coordination: add a sync.WaitGroup or a
dedicated done channel tracked by the collector (refer to collectMetricsLoop,
collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have Stop signal the
loop to exit (via stopCh or context) and then wait for the goroutine to finish
before closing batchChan and stopping the ticker; alternatively close batchChan
from inside the collector goroutine after the loop ends instead of from Stop.
- Around line 24-35: Update the doc comment for
PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to match the actual
default used by the constructor (the in-cluster URL
"http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the
struct PersistentVolumeClaimMetricsCollectorConfig and replace the current
comment that says "http://prometheus.monitoring:9090" with the correct default
URL so the documentation matches the code path that sets the default in the
collector constructor that initializes PrometheusURL.
- Around line 681-697: IsAvailable on PersistentVolumeClaimMetricsCollector
currently returns false when prometheusAPI is nil causing a registration
deadlock; fix by lazily initializing prometheusAPI inside IsAvailable (or move
initialization into the constructor) before issuing the test Query: call the
same client creation routine used by Start (or a new helper like
initPrometheusClient), handle and log initialization errors, and ensure
concurrent safety by using a mutex or sync.Once around prometheusAPI assignment
so multiple callers don't race; after successful init proceed with the Query and
return true/false based on the Query result.

coderabbitai · 2026-01-23T11:51:17Z

+// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector
+type PersistentVolumeClaimMetricsCollectorConfig struct {
+	// UpdateInterval specifies how often to collect PVC metrics
+	UpdateInterval time.Duration
+
+	// PrometheusURL specifies the URL of the Prometheus instance to query
+	// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
+	PrometheusURL string
+
+	// QueryTimeout specifies the timeout for Prometheus queries
+	QueryTimeout time.Duration
+}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the pvc_metrics_collector.go file find . -name "pvc_metrics_collector.go" -type f

Repository: devzero-inc/zxporter

Length of output: 109

🏁 Script executed:

# Read the file to verify the doc comment and constructor cat -n ./internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 25479

Update doc comment to match the actual default Prometheus URL.

The config comment at line 30 states the default is http://prometheus.monitoring:9090, but the constructor at line 84 sets it to http://prometheus-service.monitoring.svc.cluster.local:8080. Update the comment to reflect the actual implementation.

Suggested fix

-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090 +// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector

type PersistentVolumeClaimMetricsCollectorConfig struct {

// UpdateInterval specifies how often to collect PVC metrics

UpdateInterval time.Duration

// PrometheusURL specifies the URL of the Prometheus instance to query

// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090

PrometheusURL string

// QueryTimeout specifies the timeout for Prometheus queries

QueryTimeout time.Duration

}

// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector

type PersistentVolumeClaimMetricsCollectorConfig struct {

// UpdateInterval specifies how often to collect PVC metrics

UpdateInterval time.Duration

// PrometheusURL specifies the URL of the Prometheus instance to query

// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080

PrometheusURL string

// QueryTimeout specifies the timeout for Prometheus queries

QueryTimeout time.Duration

}

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 24 - 35, Update the doc comment for PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to match the actual default used by the constructor (the in-cluster URL "http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the struct PersistentVolumeClaimMetricsCollectorConfig and replace the current comment that says "http://prometheus.monitoring:9090" with the correct default URL so the documentation matches the code path that sets the default in the collector constructor that initializes PrometheusURL.

coderabbitai · 2026-01-23T11:51:17Z

+	c.ticker = time.NewTicker(c.config.UpdateInterval)
+
+	go c.collectMetricsLoop(ctx)
+
+	stopCh := c.stopCh
+	go func() {
+		select {
+		case <-ctx.Done():
+			c.Stop()
+		case <-stopCh:
+		}
+	}()
+


⚠️ Potential issue | 🟠 Major

Prevent send-on-closed panic during shutdown.

collectAllPVCMetrics can still emit snapshots while Stop() closes batchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to finish) before closing the channel.

🔧 Suggested fix

type PersistentVolumeClaimMetricsCollector struct { // ... mu sync.RWMutex + wg sync.WaitGroup } - go c.collectMetricsLoop(ctx) + c.wg.Add(1) + go func() { + defer c.wg.Done() + c.collectMetricsLoop(ctx) + }() // ... select { case <-c.stopCh: c.logger.Info("PVC metrics collector stop channel already closed") default: close(c.stopCh) c.logger.Info("Closed PVC metrics collector stop channel") } - if c.batchChan != nil { + c.wg.Wait() + if c.batchChan != nil { close(c.batchChan) c.batchChan = nil c.logger.Info("Closed PVC metrics collector batch input channel") }

Also applies to: 559-579, 642-668

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 176 - 188, Shutdown can race: collectAllPVCMetrics (from collectMetricsLoop) may send to batchChan after Stop() closes it causing a send-on-closed panic; change shutdown to coordinate goroutines so batchChan is closed only after the collector loop exits. Modify collectMetricsLoop/collectAllPVCMetrics coordination: add a sync.WaitGroup or a dedicated done channel tracked by the collector (refer to collectMetricsLoop, collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have Stop signal the loop to exit (via stopCh or context) and then wait for the goroutine to finish before closing batchChan and stopping the ticker; alternatively close batchChan from inside the collector goroutine after the loop ends instead of from Stop.

coderabbitai · 2026-01-23T11:51:17Z

+func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
+	if c.prometheusAPI == nil {
+		c.logger.Info("Prometheus API not available for PVC metrics")
+		return false
+	}
+
+	queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+	defer cancel()
+
+	_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
+	if err != nil {
+		c.logger.Info("Prometheus API not responding", "error", err.Error())
+		return false
+	}
+
+	return true
+}


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the file and understand its structure find . -name "pvc_metrics_collector.go" -type f

Repository: devzero-inc/zxporter

Length of output: 184

🏁 Script executed:

# Search for the Start() method in the collector to see where prometheusAPI is initialized rg -A 20 "func \(c \*PersistentVolumeClaimMetricsCollector\) Start" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 727

🏁 Script executed:

# Search for where IsAvailable is called to understand the lifecycle rg "IsAvailable" --type go -B 3 -A 3

Repository: devzero-inc/zxporter

Length of output: 29640

🏁 Script executed:

# Look for the NewPersistentVolumeClaimMetricsCollector constructor rg -A 15 "func NewPersistentVolumeClaimMetricsCollector" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 614

🏁 Script executed:

# Get more context from the controller file about the registration flow rg -A 30 "if c.collector.IsAvailable" internal/controller/collectionpolicy_controller.go

Repository: devzero-inc/zxporter

Length of output: 1203

🏁 Script executed:

# Look for when Start() is called on collectors rg "\.Start\(" internal/controller/collectionpolicy_controller.go -B 5 -A 5

Repository: devzero-inc/zxporter

Length of output: 1287

🏁 Script executed:

# Check the complete collector registration and initialization sequence rg -A 50 "RegisterCollector" internal/collector/interface.go

Repository: devzero-inc/zxporter

Length of output: 46

🏁 Script executed:

# Look for where collectors are created and when Start is called in the reconciliation rg -B 5 -A 15 "func.*Reconcile" internal/controller/collectionpolicy_controller.go | head -80

Repository: devzero-inc/zxporter

Length of output: 3330

🏁 Script executed:

# Search for RegisterCollector implementation to see if it calls Start() rg -A 30 "func.*RegisterCollector" internal/

Repository: devzero-inc/zxporter

Length of output: 1965

🏁 Script executed:

# Check if there's any other place prometheusAPI is initialized rg "prometheusAPI =" internal/collector/pvc_metrics_collector.go -B 2 -A 2

Repository: devzero-inc/zxporter

Length of output: 519

🏁 Script executed:

# Look for any initialization of prometheusAPI before Start() rg "prometheusAPI" internal/collector/pvc_metrics_collector.go | head -20

Repository: devzero-inc/zxporter

Length of output: 475

🏁 Script executed:

# Get full context of the PVC metrics collector constructor and fields rg -B 5 -A 40 "type PersistentVolumeClaimMetricsCollector struct" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 1737

🏁 Script executed:

# Search for where Start() is called on registered collectors rg "\.Start\(" internal/collector/manager.go -B 5 -A 10

Repository: devzero-inc/zxporter

Length of output: 620

🏁 Script executed:

# Look for the full reconciliation flow to see when collectors are started rg -B 10 -A 20 "registerResourceCollectors" internal/controller/collectionpolicy_controller.go | head -100

Repository: devzero-inc/zxporter

Length of output: 2905

Lazy-initialize Prometheus API client in IsAvailable().

The collector cannot be registered because prometheusAPI is only initialized in Start(), but IsAvailable() is called before Start(). When IsAvailable() returns false due to prometheusAPI being nil, the collector is never registered, so Start() is never called, creating a deadlock where the collector can never run.

Initialize the Prometheus client in IsAvailable() if not already initialized (or move initialization to the constructor) to break this cycle.

🔧 Proposed fix (lazy init)

func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { if c.prometheusAPI == nil { - c.logger.Info("Prometheus API not available for PVC metrics") - return false + httpClient := NewPrometheusClient(c.metrics) + client, err := api.NewClient(api.Config{ + Address: c.config.PrometheusURL, + Client: httpClient, + }) + if err != nil { + c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error()) + return false + } + c.prometheusAPI = v1.NewAPI(client) }

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 697, IsAvailable on PersistentVolumeClaimMetricsCollector currently returns false when prometheusAPI is nil causing a registration deadlock; fix by lazily initializing prometheusAPI inside IsAvailable (or move initialization into the constructor) before issuing the test Query: call the same client creation routine used by Start (or a new helper like initPrometheusClient), handle and log initialization errors, and ensure concurrent safety by using a mutex or sync.Once around prometheusAPI assignment so multiple callers don't race; after successful init proceed with the Query and return true/false based on the Query result.

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@internal/collector/pvc_metrics_collector.go`:
- Around line 134-143: In PVCMetricsCollector's Start (or the block where
Prometheus client creation fails) avoid calling c.telemetryLogger.Report
unconditionally; guard the call with a nil check (if c.telemetryLogger != nil {
c.telemetryLogger.Report(...) }) or fallback to an existing non-telemetry logger
(e.g., c.logger) so a nil telemetryLogger won't cause a panic when Prometheus
client creation returns an error; ensure the function still returns the original
error after logging.
- Around line 165-170: The informer factory is started via
c.informerFactory.Start(c.stopCh) but if cache.WaitForCacheSync(...) returns
false you currently return an error without stopping informers; before
returning, close the stop channel (or call the factory stop if available) to
signal goroutine shutdown. Update the error path in the block that calls
cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced,
c.pvInformer.HasSynced) to close c.stopCh (or invoke the informer factory stop)
and then return the fmt.Errorf("timed out waiting for caches to sync") to
prevent informer goroutine leaks.
- Around line 504-553: The Prometheus queries in the PVC metrics collector
currently pick vector[0] and can return multiple series; change the query
strings in the queries map to use max aggregation by PVC (e.g. `max by
(namespace, persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for
"used", "capacity" and "available", then in the code that calls
c.prometheusAPI.Query (the blocks that inspect result.Type() == model.ValVector
and convert to model.Vector) enforce that the returned vector has exactly one
sample (len(vector) == 1) and return a clear error if it has 0 or >1 samples;
also treat non-vector result types as errors rather than silently ignoring them
(affects prometheusAPI.Query handling and population of filesystemUsage).

♻️ Duplicate comments (3)

internal/collector/pvc_metrics_collector.go (3)
29-31: Doc comment default URL is out of sync with constructor.

The comment still references http://prometheus.monitoring:9090, but the constructor defaults to http://prometheus-service.monitoring.svc.cluster.local:9090. Keeping these aligned avoids confusion for operators.
💡 Suggested fix
-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
+// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:9090
642-663: Avoid send-on-closed panic during shutdown.

collectAllPVCMetrics can still emit snapshots while Stop() closes batchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to exit) before closing the channel.
🔧 Suggested fix
 type PersistentVolumeClaimMetricsCollector struct {
 	// ...
 	mu              sync.RWMutex
+	wg              sync.WaitGroup
 }
-	go c.collectMetricsLoop(ctx)
+	c.wg.Add(1)
+	go func() {
+		defer c.wg.Done()
+		c.collectMetricsLoop(ctx)
+	}()
-	if c.batchChan != nil {
+	c.wg.Wait()
+	if c.batchChan != nil {
 		close(c.batchChan)
 		c.batchChan = nil
 		c.logger.Info("Closed PVC metrics collector batch input channel")
 	}
681-685: Initialize Prometheus API before availability check.

IsAvailable() returns false when prometheusAPI is nil, but prometheusAPI is only initialized in Start(). If IsAvailable() is used to decide registration, the collector can never be registered. Lazy-init in IsAvailable() (or move initialization into the constructor) and guard with sync.Once/mutex to avoid races.
🔧 Suggested fix (lazy init)
 func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
 	if c.prometheusAPI == nil {
-		c.logger.Info("Prometheus API not available for PVC metrics")
-		return false
+		httpClient := NewPrometheusClient(c.metrics)
+		client, err := api.NewClient(api.Config{
+			Address: c.config.PrometheusURL,
+			Client:  httpClient,
+		})
+		if err != nil {
+			c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error())
+			return false
+		}
+		c.prometheusAPI = v1.NewAPI(client)
 	}
#!/bin/bash
# Verify where IsAvailable() and Start() are called to confirm registration order.
rg -n "IsAvailable\(" internal -C3
rg -n "\.Start\(" internal -C3

coderabbitai · 2026-01-23T18:12:36Z

+		c.telemetryLogger.Report(
+			gen.LogLevel_LOG_LEVEL_ERROR,
+			"PVCMetricsCollector",
+			"Failed to create Prometheus client",
+			err,
+			map[string]string{
+				"prometheus_url":   c.config.PrometheusURL,
+				"zxporter_version": version.Get().String(),
+			},
+		)


⚠️ Potential issue | 🟠 Major

Guard telemetryLogger in the error path to avoid a nil panic.

telemetryLogger is optional elsewhere, but here it’s dereferenced unconditionally. If it’s nil and Prometheus client creation fails, Start() will panic instead of returning the error.

🛠️ Suggested fix

- c.telemetryLogger.Report( - gen.LogLevel_LOG_LEVEL_ERROR, - "PVCMetricsCollector", - "Failed to create Prometheus client", - err, - map[string]string{ - "prometheus_url": c.config.PrometheusURL, - "zxporter_version": version.Get().String(), - }, - ) + if c.telemetryLogger != nil { + c.telemetryLogger.Report( + gen.LogLevel_LOG_LEVEL_ERROR, + "PVCMetricsCollector", + "Failed to create Prometheus client", + err, + map[string]string{ + "prometheus_url": c.config.PrometheusURL, + "zxporter_version": version.Get().String(), + }, + ) + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

c.telemetryLogger.Report(

gen.LogLevel_LOG_LEVEL_ERROR,

"PVCMetricsCollector",

"Failed to create Prometheus client",

err,

map[string]string{

"prometheus_url": c.config.PrometheusURL,

"zxporter_version": version.Get().String(),

},

)

if c.telemetryLogger != nil {

c.telemetryLogger.Report(

gen.LogLevel_LOG_LEVEL_ERROR,

"PVCMetricsCollector",

"Failed to create Prometheus client",

err,

map[string]string{

"prometheus_url": c.config.PrometheusURL,

"zxporter_version": version.Get().String(),

},

)

}

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 134 - 143, In PVCMetricsCollector's Start (or the block where Prometheus client creation fails) avoid calling c.telemetryLogger.Report unconditionally; guard the call with a nil check (if c.telemetryLogger != nil { c.telemetryLogger.Report(...) }) or fallback to an existing non-telemetry logger (e.g., c.logger) so a nil telemetryLogger won't cause a panic when Prometheus client creation returns an error; ensure the function still returns the original error after logging.

coderabbitai · 2026-01-23T18:12:36Z

+	c.informerFactory.Start(c.stopCh)
+
+	c.logger.Info("Waiting for informer caches to sync")
+	if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
+		return fmt.Errorf("timed out waiting for caches to sync")
+	}


⚠️ Potential issue | 🟡 Minor

Clean up informers when cache sync fails.

If WaitForCacheSync fails, the informer factory has already started and will keep goroutines running. Consider closing stopCh (or calling Stop()) before returning the error to avoid leaks.

🛠️ Suggested fix

if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) { + select { + case <-c.stopCh: + default: + close(c.stopCh) + } return fmt.Errorf("timed out waiting for caches to sync") }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

c.informerFactory.Start(c.stopCh)

c.logger.Info("Waiting for informer caches to sync")

if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {

return fmt.Errorf("timed out waiting for caches to sync")

}

c.informerFactory.Start(c.stopCh)

c.logger.Info("Waiting for informer caches to sync")

if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {

select {

case <-c.stopCh:

default:

close(c.stopCh)

}

return fmt.Errorf("timed out waiting for caches to sync")

}

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 165 - 170, The informer factory is started via c.informerFactory.Start(c.stopCh) but if cache.WaitForCacheSync(...) returns false you currently return an error without stopping informers; before returning, close the stop channel (or call the factory stop if available) to signal goroutine shutdown. Update the error path in the block that calls cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) to close c.stopCh (or invoke the informer factory stop) and then return the fmt.Errorf("timed out waiting for caches to sync") to prevent informer goroutine leaks.

coderabbitai · 2026-01-23T18:12:36Z

+	// Prometheus queries for PVC volume stats
+	queries := map[string]string{
+		"used":      fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
+		"capacity":  fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
+		"available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
+	}
+
+	usage := &filesystemUsage{}
+	queryTime := time.Now()
+
+	result, _, err := c.prometheusAPI.Query(ctx, queries["used"], queryTime)
+	if err != nil {
+		return nil, fmt.Errorf("failed to query used bytes: %w", err)
+	}
+
+	if result.Type() == model.ValVector {
+		vector := result.(model.Vector)
+		if len(vector) > 0 {
+			usage.UsedBytes = int64(vector[0].Value)
+		} else {
+			return nil, fmt.Errorf("no used bytes metric found")
+		}
+	}
+
+	result, _, err = c.prometheusAPI.Query(ctx, queries["capacity"], queryTime)
+	if err != nil {
+		return nil, fmt.Errorf("failed to query capacity bytes: %w", err)
+	}
+
+	if result.Type() == model.ValVector {
+		vector := result.(model.Vector)
+		if len(vector) > 0 {
+			usage.CapacityBytes = int64(vector[0].Value)
+		} else {
+			return nil, fmt.Errorf("no capacity bytes metric found")
+		}
+	}
+
+	result, _, err = c.prometheusAPI.Query(ctx, queries["available"], queryTime)
+	if err != nil {
+		return nil, fmt.Errorf("failed to query available bytes: %w", err)
+	}
+
+	if result.Type() == model.ValVector {
+		vector := result.(model.Vector)
+		if len(vector) > 0 {
+			usage.AvailableBytes = int64(vector[0].Value)
+		} else {
+			return nil, fmt.Errorf("no available bytes metric found")
+		}


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

kubelet_volume_stats metrics multiple series per PVC Prometheus aggregation

💡 Result:

kubelet_volume_stats_* are emitted by each kubelet (i.e., per scrape target / node). So you can legitimately see multiple time series for the same (namespace, persistentvolumeclaim) when the same PVC is visible from more than one kubelet scrape target (e.g., multi-attach volumes, RWX volumes mounted on multiple nodes, or duplicate scraping / relabeling that creates multiple targets for the same node). The “canonical” identifying labels for these metrics are namespace + persistentvolumeclaim (and then you’ll typically also have target labels like instance, node, job, etc.). [1], [2]

PromQL: de-duplicate / collapse to 1 series per PVC

In most setups, each kubelet reports the same bytes values for a given PVC, so you usually want max (not sum) across the per-node/target duplicates:

max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes) max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes) max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes)

PVC usage percent (1 series per PVC):

100 * max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes) / max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)

(You’ll often see examples using sum by (...) for per-PVC rollups; if you have duplicates, prefer max to avoid overcounting.) [3]

If your “namespace” label looks wrong (common kube-prometheus-stack gotcha)

If you scrape kubelet with honorLabels: true, Prometheus can rename clashing labels to exported_namespace, etc. In that case you may need to filter/aggregate on exported_namespace (or disable honorLabels for that scrape). [4]

Sources:
[1] Kubernetes docs: Volume stats metrics use namespace and persistentvolumeclaim labels (example: health status metric) https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/
[2] GKE kubelet metrics reference showing kubelet_volume_stats_* label set includes namespace and persistentvolumeclaim https://cloud.google.com/kubernetes-engine/docs/how-to/cadvisor-kubelet-metrics
[3] StackOverflow examples of aggregating kubelet_volume_stats_* by (namespace, persistentvolumeclaim) https://stackoverflow.com/questions/44718268/how-to-monitor-disk-usage-of-kubernetes-persistent-volumes
[4] Discussion of exported_namespace / honorLabels causing misleading PVC labels https://www.reddit.com/r/PrometheusMonitoring/comments/17l21h3/information_about_kubernetes_pvcs_are_wrong/

🏁 Script executed:

# Examine the code file to understand the context better head -n 560 internal/collector/pvc_metrics_collector.go | tail -n 70

Repository: devzero-inc/zxporter

Length of output: 2378

🏁 Script executed:

# Check if there are tests or examples in the codebase fd -e go pvc_metrics | head -20

Repository: devzero-inc/zxporter

Length of output: 141

🏁 Script executed:

# Look for any existing aggregation or handling patterns in the codebase rg "sum by|topk|max|min" -t go | grep -i pvc | head -20

Repository: devzero-inc/zxporter

Length of output: 543

Use max aggregation for PVC metrics; enforce single-sample validation to avoid incorrect results from arbitrary sample selection.

These metrics are emitted per kubelet instance, producing multiple series for the same PVC when mounted on multiple nodes (common for RWX volumes). Taking vector[0] selects arbitrary data, and non-vector results are silently ignored, leaving zero values. Aggregate with max by (namespace, persistentvolumeclaim) (not sum) and enforce exactly one sample per query.

🛠️ Suggested fix

- queries := map[string]string{ - "used": fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), - "capacity": fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), - "available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), - } + queries := map[string]string{ + "used": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name), + "capacity": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name), + "available": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name), + }

- if result.Type() == model.ValVector { - vector := result.(model.Vector) - if len(vector) > 0 { - usage.UsedBytes = int64(vector[0].Value) - } else { - return nil, fmt.Errorf("no used bytes metric found") - } - } + vector, ok := result.(model.Vector) + if !ok { + return nil, fmt.Errorf("unexpected result type for used bytes: %s", result.Type()) + } + if len(vector) != 1 { + return nil, fmt.Errorf("expected 1 sample for used bytes, got %d", len(vector)) + } + usage.UsedBytes = int64(vector[0].Value)

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 504 - 553, The Prometheus queries in the PVC metrics collector currently pick vector[0] and can return multiple series; change the query strings in the queries map to use max aggregation by PVC (e.g. `max by (namespace, persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for "used", "capacity" and "available", then in the code that calls c.prometheusAPI.Query (the blocks that inspect result.Type() == model.ValVector and convert to model.Vector) enforce that the returned vector has exactly one sample (len(vector) == 1) and return a clear error if it has 0 or >1 samples; also treat non-vector result types as errors rather than silently ignoring them (affects prometheusAPI.Query handling and population of filesystemUsage).

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

internal/collector/interface.go (1)

90-111: Do not insert new constants mid-iota; append to preserve existing numeric values.

Inserting PersistentVolumeClaimMetrics between PersistentVolume and StorageClass renumbers all subsequent constants, breaking compatibility if these integers are serialized or stored externally. The codebase explicitly converts ResourceType to protobuf via .ProtoType() and transmits values over gRPC (internal/transport/dakr_client.go), confirming that numeric values matter. The existing comments "leaving here to not screw up enum numbering" on later constants show this concern is known. Move PersistentVolumeClaimMetrics to the end of the enum instead.

🤖 Fix all issues with AI agents

In `@internal/collector/pvc_metrics_collector.go`:
- Around line 681-694: The IsAvailable method on
PersistentVolumeClaimMetricsCollector incorrectly always returns true; change it
to return false when the Prometheus API is unavailable or the query fails: in
PersistentVolumeClaimMetricsCollector.IsAvailable, when c.prometheusAPI == nil
log the condition and return false, and after the Query call if err != nil log
the error and return false; otherwise return true. Ensure you reference the
prometheusAPI field and the IsAvailable method so callers in
collectionpolicy_controller.go will correctly skip unavailable collectors.

coderabbitai · 2026-01-27T11:24:08Z

+func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
+	if c.prometheusAPI == nil {
+		c.logger.Info("Prometheus API not available for PVC metrics")
+		return true
+	}
+
+	queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+	defer cancel()
+
+	_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
+	if err != nil {
+		c.logger.Info("Prometheus API not responding", "error", err.Error())
+		return true
+	}


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /root && find . -type f -name "pvc_metrics_collector.go" | head -5

Repository: devzero-inc/zxporter

Length of output: 119

🏁 Script executed:

find . -type f -name "pvc_metrics_collector.go" -exec wc -l {} \;

Repository: devzero-inc/zxporter

Length of output: 113

🏁 Script executed:

find . -type f -name "pvc_metrics_collector.go" -exec cat -n {} \; | sed -n '675,700p'

Repository: devzero-inc/zxporter

Length of output: 946

🏁 Script executed:

rg "IsAvailable" --type go -B 2 -A 2 | grep -A 3 -B 3 "pvc_metrics_collector\|PersistentVolumeClaimMetricsCollector"

Repository: devzero-inc/zxporter

Length of output: 715

🏁 Script executed:

rg "\.IsAvailable\(" --type go

Repository: devzero-inc/zxporter

Length of output: 261

🏁 Script executed:

cat -n internal/controller/collectionpolicy_controller.go | grep -B 5 -A 5 "IsAvailable"

Repository: devzero-inc/zxporter

Length of output: 960

IsAvailable currently always returns true, preventing availability gating from working.

The method logs error conditions at lines 683 and 692 but still returns true, which contradicts the intent. Callers at line 1494 and 2659 of collectionpolicy_controller.go rely on IsAvailable returning false to skip unavailable collectors. By always returning true, unavailable collectors (when Prometheus API is nil or unresponsive) still get registered, defeating the availability check.

🛠️ Suggested fix

if c.prometheusAPI == nil { c.logger.Info("Prometheus API not available for PVC metrics") - return true + return false } @@ if err != nil { c.logger.Info("Prometheus API not responding", "error", err.Error()) - return true + return false }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {

if c.prometheusAPI == nil {

c.logger.Info("Prometheus API not available for PVC metrics")

return true

}

queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)

defer cancel()

_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())

if err != nil {

c.logger.Info("Prometheus API not responding", "error", err.Error())

return true

}

func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {

if c.prometheusAPI == nil {

c.logger.Info("Prometheus API not available for PVC metrics")

return false

}

queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)

defer cancel()

_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())

if err != nil {

c.logger.Info("Prometheus API not responding", "error", err.Error())

return false

}

return true

}

🤖 Prompt for AI Agents

In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 694, The IsAvailable method on PersistentVolumeClaimMetricsCollector incorrectly always returns true; change it to return false when the Prometheus API is unavailable or the query fails: in PersistentVolumeClaimMetricsCollector.IsAvailable, when c.prometheusAPI == nil log the condition and return false, and after the Query call if err != nil log the error and return false; otherwise return true. Ensure you reference the prometheusAPI field and the IsAvailable method so callers in collectionpolicy_controller.go will correctly skip unavailable collectors.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

internal/controller/collectionpolicy_controller.go (2)
693-695: Missing pvc_metrics in affected collectors when ExcludedPVCs changes.

The pvc_metrics collector constructor takes ExcludedPVCs as a parameter (line 2334), but when ExcludedPVCs changes, only "persistent_volume_claim" is marked as affected. The pvc_metrics collector will not be restarted and will continue using stale exclusion rules.
🐛 Proposed fix
 if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) {
     affectedCollectors["persistent_volume_claim"] = true
+    affectedCollectors["pvc_metrics"] = true
 }
831-837: Missing pvc_metrics when Prometheus configuration changes.

The pvc_metrics collector queries Prometheus (via PrometheusURL) and uses UpdateInterval, but it's not included in the affected collectors when these values change. This means configuration changes won't take effect until a full restart.
🐛 Proposed fix
 if oldConfig.UpdateInterval != newConfig.UpdateInterval ||
     oldConfig.PrometheusURL != newConfig.PrometheusURL ||
     oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics ||
     oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics {
     affectedCollectors["node"] = true
     affectedCollectors["container_resource"] = true
+    affectedCollectors["pvc_metrics"] = true
 }

🤖 Fix all issues with AI agents

In `@helm-chart/zxporter/local_values.yaml`:
- Around line 31-36: Remove the hard-coded personal access token by clearing or
replacing the patToken value in the Helm values so no real secret is committed;
locate the patToken entry and set it to an empty string or a non-secret
placeholder (e.g., "") and add a comment directing operators to supply the real
PAT via a secret or external secret management at deploy time (reference:
patToken).

* pvc metrics collector added * endpoint fixed * make genrate-proto fix every thing * fix lint error * local deployment values deleted ---------

pvc metrics collector added

a9a4842

Rupam-It requested review from Parthiba-Hazra and Tzvonimir January 23, 2026 11:38

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

endpoint fixed

1a10ecd

coderabbitai Bot reviewed Jan 23, 2026

View reviewed changes

make genrate-proto fix every thing

92293de

coderabbitai Bot reviewed Jan 27, 2026

View reviewed changes

Merge main into feature branch

1606b22

coderabbitai Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread helm-chart/zxporter/local_values.yaml Outdated

Rupam added 2 commits January 28, 2026 01:01

fix lint error

95dda63

local deployment values deleted

bac4bb9

f-leu added the changelog:added New features label Jan 28, 2026

Tzvonimir approved these changes Jan 28, 2026

View reviewed changes

Merge branch 'main' into rupam/zxporter—PVCStorageUsageMetricsCollector

1149207

Rupam-It merged commit 0cdb3c1 into main Jan 28, 2026
26 checks passed

Rupam-It deleted the rupam/zxporter—PVCStorageUsageMetricsCollector branch January 28, 2026 10:45

Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026

pvc metrics collector added (#252)

a45f03e

* pvc metrics collector added * endpoint fixed * make genrate-proto fix every thing * fix lint error * local deployment values deleted ---------

Conversation

Rupam-It commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

pvc metrics collector added

📚 Description of Changes

Uh oh!

coderabbitai Bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 23, 2026

Choose a reason for hiding this comment

PromQL: de-duplicate / collapse to 1 series per PVC

If your “namespace” label looks wrong (common kube-prometheus-stack gotcha)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Rupam-It commented Jan 23, 2026 •

edited

Loading

coderabbitai Bot commented Jan 23, 2026 •

edited

Loading