Skip to content

pvc metrics collector added#252

Merged
Rupam-It merged 7 commits into
mainfrom
rupam/zxporter—PVCStorageUsageMetricsCollector
Jan 28, 2026

Hidden character warning

The head ref may contain hidden characters: "rupam/zxporter\u2014PVCStorageUsageMetricsCollector"
Merged

pvc metrics collector added#252
Rupam-It merged 7 commits into
mainfrom
rupam/zxporter—PVCStorageUsageMetricsCollector

Conversation

@Rupam-It
Copy link
Copy Markdown
Contributor

@Rupam-It Rupam-It commented Jan 23, 2026

pvc metrics collector added

📚 Description of Changes

In this PR , I have extend the core functionality of metrics collector with adding a new metrics collector of PVC, Now zxporter can also send pvc along side of container metrics!

  • What Changed:
    I have a new pvc collector ! that will collect metrics of pvc stuff!

  • Why This Change:
    this will change the extendability for add metrics for pvc as well !

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 23, 2026

📝 Walkthrough

Walkthrough

Adds a new PersistentVolumeClaimMetrics resource type, a strongly-typed PVC metrics snapshot, a Prometheus-backed PVC metrics collector with informer-driven listing and batching, controller wiring to register/start/stop the collector, and a local Helm values file.

Changes

Cohort / File(s) Summary
Resource Type Definition
internal/collector/interface.go
Added PersistentVolumeClaimMetrics to ResourceType, mapped to "pvc_metrics", and updated ProtoType() handling.
Data Model
internal/collector/pvc_metrics.go
Added PersistentVolumeClaimMetricsSnapshot struct (PVC identity, binding/storage metadata, usage metrics, stats availability; JSON-tagged).
Collector Implementation
internal/collector/pvc_metrics_collector.go
New PersistentVolumeClaimMetricsCollector and PersistentVolumeClaimMetricsCollectorConfig; constructor, Start/Stop, collection loop, Prometheus queries for filesystem usage, per-PVC processing, batching/emission, helper functions, exclusions, and availability checks.
Controller Integration
internal/controller/collectionpolicy_controller.go
Register, construct, start/stop, and dynamically enable/disable the PersistentVolumeClaimMetrics collector; integrated into restart and dynamic-enable flows.
Formatting Only
internal/collector/pvc_collector.go
Minor composite-literal formatting change; no behavioral change.
Helm Values
helm-chart/zxporter/local_values.yaml
Added local Helm values file with ZXPorter defaults for deployment, monitoring, tokens, scheduling, and HA settings.

Sequence Diagram(s)

sequenceDiagram
    participant Controller
    participant Collector as PVC Metrics Collector
    participant K8s as Kubernetes Informers
    participant Prometheus
    participant Batcher

    Controller->>Collector: Start(ctx)
    Collector->>K8s: Init informers & WaitForCacheSync
    K8s-->>Collector: Cache ready (PVCs)
    Collector->>Batcher: Start batcher
    Collector->>Collector: Launch collectMetricsLoop
    loop Periodic collection
        Collector->>K8s: List PVCs (namespace-scoped or all)
        K8s-->>Collector: PVC list
        Collector->>Collector: Filter exclusions / skip unbound
        loop Per-PVC
            alt filesystem-backed
                Collector->>Prometheus: Query used/capacity/available
                Prometheus-->>Collector: metrics or errors
            else block-backed
                Collector-->>Collector: mark stats unavailable
            end
            Collector->>Batcher: emitSnapshot(snapshot)
        end
        Batcher-->>Controller: Send batched snapshots -> output channel
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • dray92
  • sandipanpanda

Poem

🐰 I hopped in with metrics bright and keen,
I asked Prometheus what the bytes had seen.
I batched each PVC’s little storage song,
Sent snapshots hopping swiftly along. ✨📦

🚥 Pre-merge checks | ✅ 1 | ❌ 4
❌ Failed checks (2 warnings, 2 inconclusive)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The helm-chart/zxporter/local_values.yaml addition appears to be a configuration file unrelated to the core PVC metrics collector implementation and enum changes. Remove the helm-chart/zxporter/local_values.yaml file or provide justification for its inclusion within the scope of PVC metrics collector functionality.
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'pvc metrics collector added' is vague and uses generic phrasing that doesn't clearly convey the specific changes or the importance of the metric collection functionality. Consider using a more descriptive title like 'Add PersistentVolumeClaim metrics collector' to better clarify the primary change and its scope.
Linked Issues check ❓ Inconclusive The linked issue #123 'cluster snapshot enum name fix' appears unrelated to the primary changes involving PVC metrics collection, making it impossible to validate compliance. Clarify the relationship between issue #123 and this PR, or link the correct issue(s) that define the requirements for PVC metrics collector implementation.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/collector/interface.go (1)

207-309: Add ProtoType mapping for PersistentVolumeClaimMetrics.

Without a case in ProtoType(), this new resource type will serialize as RESOURCE_TYPE_UNSPECIFIED, so pvc-metrics resources will be mislabeled on the wire.

🔧 Proposed fix
 case PersistentVolume:
 	return gen.ResourceType_RESOURCE_TYPE_PERSISTENT_VOLUME
+case PersistentVolumeClaimMetrics:
+	return gen.ResourceType_RESOURCE_TYPE_PVC_METRICS // use the actual proto enum name
 case StorageClass:
 	return gen.ResourceType_RESOURCE_TYPE_STORAGE_CLASS
internal/controller/collectionpolicy_controller.go (1)

658-661: Config changes won’t restart pvc_metrics.

identifyAffectedCollectors updates PVC resource collectors, but doesn’t include pvc_metrics for PVC exclusion or Prometheus/update-interval changes, so metrics can keep stale config until a full restart.

🔧 Proposed fix
 if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) {
 	affectedCollectors["persistent_volume_claim"] = true
+	affectedCollectors["pvc_metrics"] = true
 }

 if oldConfig.UpdateInterval != newConfig.UpdateInterval ||
 	oldConfig.PrometheusURL != newConfig.PrometheusURL ||
 	oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics ||
 	oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics {
 	affectedCollectors["node"] = true
 	affectedCollectors["container_resource"] = true
+	affectedCollectors["pvc_metrics"] = true
 }

Also applies to: 787-794

🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 176-188: Shutdown can race: collectAllPVCMetrics (from
collectMetricsLoop) may send to batchChan after Stop() closes it causing a
send-on-closed panic; change shutdown to coordinate goroutines so batchChan is
closed only after the collector loop exits. Modify
collectMetricsLoop/collectAllPVCMetrics coordination: add a sync.WaitGroup or a
dedicated done channel tracked by the collector (refer to collectMetricsLoop,
collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have Stop signal the
loop to exit (via stopCh or context) and then wait for the goroutine to finish
before closing batchChan and stopping the ticker; alternatively close batchChan
from inside the collector goroutine after the loop ends instead of from Stop.
- Around line 24-35: Update the doc comment for
PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to match the actual
default used by the constructor (the in-cluster URL
"http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the
struct PersistentVolumeClaimMetricsCollectorConfig and replace the current
comment that says "http://prometheus.monitoring:9090" with the correct default
URL so the documentation matches the code path that sets the default in the
collector constructor that initializes PrometheusURL.
- Around line 681-697: IsAvailable on PersistentVolumeClaimMetricsCollector
currently returns false when prometheusAPI is nil causing a registration
deadlock; fix by lazily initializing prometheusAPI inside IsAvailable (or move
initialization into the constructor) before issuing the test Query: call the
same client creation routine used by Start (or a new helper like
initPrometheusClient), handle and log initialization errors, and ensure
concurrent safety by using a mutex or sync.Once around prometheusAPI assignment
so multiple callers don't race; after successful init proceed with the Query and
return true/false based on the Query result.

Comment on lines +24 to +35
// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector
type PersistentVolumeClaimMetricsCollectorConfig struct {
// UpdateInterval specifies how often to collect PVC metrics
UpdateInterval time.Duration

// PrometheusURL specifies the URL of the Prometheus instance to query
// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
PrometheusURL string

// QueryTimeout specifies the timeout for Prometheus queries
QueryTimeout time.Duration
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's find and examine the pvc_metrics_collector.go file
find . -name "pvc_metrics_collector.go" -type f

Repository: devzero-inc/zxporter

Length of output: 109


🏁 Script executed:

# Read the file to verify the doc comment and constructor
cat -n ./internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 25479


Update doc comment to match the actual default Prometheus URL.

The config comment at line 30 states the default is http://prometheus.monitoring:9090, but the constructor at line 84 sets it to http://prometheus-service.monitoring.svc.cluster.local:8080. Update the comment to reflect the actual implementation.

Suggested fix
-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
+// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector
type PersistentVolumeClaimMetricsCollectorConfig struct {
// UpdateInterval specifies how often to collect PVC metrics
UpdateInterval time.Duration
// PrometheusURL specifies the URL of the Prometheus instance to query
// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
PrometheusURL string
// QueryTimeout specifies the timeout for Prometheus queries
QueryTimeout time.Duration
}
// PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector
type PersistentVolumeClaimMetricsCollectorConfig struct {
// UpdateInterval specifies how often to collect PVC metrics
UpdateInterval time.Duration
// PrometheusURL specifies the URL of the Prometheus instance to query
// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080
PrometheusURL string
// QueryTimeout specifies the timeout for Prometheus queries
QueryTimeout time.Duration
}
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 24 - 35, Update the
doc comment for PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to
match the actual default used by the constructor (the in-cluster URL
"http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the
struct PersistentVolumeClaimMetricsCollectorConfig and replace the current
comment that says "http://prometheus.monitoring:9090" with the correct default
URL so the documentation matches the code path that sets the default in the
collector constructor that initializes PrometheusURL.

Comment on lines +176 to +188
c.ticker = time.NewTicker(c.config.UpdateInterval)

go c.collectMetricsLoop(ctx)

stopCh := c.stopCh
go func() {
select {
case <-ctx.Done():
c.Stop()
case <-stopCh:
}
}()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Prevent send-on-closed panic during shutdown.

collectAllPVCMetrics can still emit snapshots while Stop() closes batchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to finish) before closing the channel.

🔧 Suggested fix
 type PersistentVolumeClaimMetricsCollector struct {
 	// ...
 	mu              sync.RWMutex
+	wg              sync.WaitGroup
 }

-	go c.collectMetricsLoop(ctx)
+	c.wg.Add(1)
+	go func() {
+		defer c.wg.Done()
+		c.collectMetricsLoop(ctx)
+	}()

 	// ...
 	select {
 	case <-c.stopCh:
 		c.logger.Info("PVC metrics collector stop channel already closed")
 	default:
 		close(c.stopCh)
 		c.logger.Info("Closed PVC metrics collector stop channel")
 	}

-	if c.batchChan != nil {
+	c.wg.Wait()
+	if c.batchChan != nil {
 		close(c.batchChan)
 		c.batchChan = nil
 		c.logger.Info("Closed PVC metrics collector batch input channel")
 	}

Also applies to: 559-579, 642-668

🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 176 - 188, Shutdown
can race: collectAllPVCMetrics (from collectMetricsLoop) may send to batchChan
after Stop() closes it causing a send-on-closed panic; change shutdown to
coordinate goroutines so batchChan is closed only after the collector loop
exits. Modify collectMetricsLoop/collectAllPVCMetrics coordination: add a
sync.WaitGroup or a dedicated done channel tracked by the collector (refer to
collectMetricsLoop, collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have
Stop signal the loop to exit (via stopCh or context) and then wait for the
goroutine to finish before closing batchChan and stopping the ticker;
alternatively close batchChan from inside the collector goroutine after the loop
ends instead of from Stop.

Comment on lines +681 to +697
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
if c.prometheusAPI == nil {
c.logger.Info("Prometheus API not available for PVC metrics")
return false
}

queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()

_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
if err != nil {
c.logger.Info("Prometheus API not responding", "error", err.Error())
return false
}

return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find the file and understand its structure
find . -name "pvc_metrics_collector.go" -type f

Repository: devzero-inc/zxporter

Length of output: 184


🏁 Script executed:

# Search for the Start() method in the collector to see where prometheusAPI is initialized
rg -A 20 "func \(c \*PersistentVolumeClaimMetricsCollector\) Start" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 727


🏁 Script executed:

# Search for where IsAvailable is called to understand the lifecycle
rg "IsAvailable" --type go -B 3 -A 3

Repository: devzero-inc/zxporter

Length of output: 29640


🏁 Script executed:

# Look for the NewPersistentVolumeClaimMetricsCollector constructor
rg -A 15 "func NewPersistentVolumeClaimMetricsCollector" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 614


🏁 Script executed:

# Get more context from the controller file about the registration flow
rg -A 30 "if c.collector.IsAvailable" internal/controller/collectionpolicy_controller.go

Repository: devzero-inc/zxporter

Length of output: 1203


🏁 Script executed:

# Look for when Start() is called on collectors
rg "\.Start\(" internal/controller/collectionpolicy_controller.go -B 5 -A 5

Repository: devzero-inc/zxporter

Length of output: 1287


🏁 Script executed:

# Check the complete collector registration and initialization sequence
rg -A 50 "RegisterCollector" internal/collector/interface.go

Repository: devzero-inc/zxporter

Length of output: 46


🏁 Script executed:

# Look for where collectors are created and when Start is called in the reconciliation
rg -B 5 -A 15 "func.*Reconcile" internal/controller/collectionpolicy_controller.go | head -80

Repository: devzero-inc/zxporter

Length of output: 3330


🏁 Script executed:

# Search for RegisterCollector implementation to see if it calls Start()
rg -A 30 "func.*RegisterCollector" internal/

Repository: devzero-inc/zxporter

Length of output: 1965


🏁 Script executed:

# Check if there's any other place prometheusAPI is initialized
rg "prometheusAPI =" internal/collector/pvc_metrics_collector.go -B 2 -A 2

Repository: devzero-inc/zxporter

Length of output: 519


🏁 Script executed:

# Look for any initialization of prometheusAPI before Start()
rg "prometheusAPI" internal/collector/pvc_metrics_collector.go | head -20

Repository: devzero-inc/zxporter

Length of output: 475


🏁 Script executed:

# Get full context of the PVC metrics collector constructor and fields
rg -B 5 -A 40 "type PersistentVolumeClaimMetricsCollector struct" internal/collector/pvc_metrics_collector.go

Repository: devzero-inc/zxporter

Length of output: 1737


🏁 Script executed:

# Search for where Start() is called on registered collectors
rg "\.Start\(" internal/collector/manager.go -B 5 -A 10

Repository: devzero-inc/zxporter

Length of output: 620


🏁 Script executed:

# Look for the full reconciliation flow to see when collectors are started
rg -B 10 -A 20 "registerResourceCollectors" internal/controller/collectionpolicy_controller.go | head -100

Repository: devzero-inc/zxporter

Length of output: 2905


Lazy-initialize Prometheus API client in IsAvailable().

The collector cannot be registered because prometheusAPI is only initialized in Start(), but IsAvailable() is called before Start(). When IsAvailable() returns false due to prometheusAPI being nil, the collector is never registered, so Start() is never called, creating a deadlock where the collector can never run.

Initialize the Prometheus client in IsAvailable() if not already initialized (or move initialization to the constructor) to break this cycle.

🔧 Proposed fix (lazy init)
 func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
 	if c.prometheusAPI == nil {
-		c.logger.Info("Prometheus API not available for PVC metrics")
-		return false
+		httpClient := NewPrometheusClient(c.metrics)
+		client, err := api.NewClient(api.Config{
+			Address: c.config.PrometheusURL,
+			Client:  httpClient,
+		})
+		if err != nil {
+			c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error())
+			return false
+		}
+		c.prometheusAPI = v1.NewAPI(client)
 	}
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 697,
IsAvailable on PersistentVolumeClaimMetricsCollector currently returns false
when prometheusAPI is nil causing a registration deadlock; fix by lazily
initializing prometheusAPI inside IsAvailable (or move initialization into the
constructor) before issuing the test Query: call the same client creation
routine used by Start (or a new helper like initPrometheusClient), handle and
log initialization errors, and ensure concurrent safety by using a mutex or
sync.Once around prometheusAPI assignment so multiple callers don't race; after
successful init proceed with the Query and return true/false based on the Query
result.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 134-143: In PVCMetricsCollector's Start (or the block where
Prometheus client creation fails) avoid calling c.telemetryLogger.Report
unconditionally; guard the call with a nil check (if c.telemetryLogger != nil {
c.telemetryLogger.Report(...) }) or fallback to an existing non-telemetry logger
(e.g., c.logger) so a nil telemetryLogger won't cause a panic when Prometheus
client creation returns an error; ensure the function still returns the original
error after logging.
- Around line 165-170: The informer factory is started via
c.informerFactory.Start(c.stopCh) but if cache.WaitForCacheSync(...) returns
false you currently return an error without stopping informers; before
returning, close the stop channel (or call the factory stop if available) to
signal goroutine shutdown. Update the error path in the block that calls
cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced,
c.pvInformer.HasSynced) to close c.stopCh (or invoke the informer factory stop)
and then return the fmt.Errorf("timed out waiting for caches to sync") to
prevent informer goroutine leaks.
- Around line 504-553: The Prometheus queries in the PVC metrics collector
currently pick vector[0] and can return multiple series; change the query
strings in the queries map to use max aggregation by PVC (e.g. `max by
(namespace, persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for
"used", "capacity" and "available", then in the code that calls
c.prometheusAPI.Query (the blocks that inspect result.Type() == model.ValVector
and convert to model.Vector) enforce that the returned vector has exactly one
sample (len(vector) == 1) and return a clear error if it has 0 or >1 samples;
also treat non-vector result types as errors rather than silently ignoring them
(affects prometheusAPI.Query handling and population of filesystemUsage).
♻️ Duplicate comments (3)
internal/collector/pvc_metrics_collector.go (3)

29-31: Doc comment default URL is out of sync with constructor.

The comment still references http://prometheus.monitoring:9090, but the constructor defaults to http://prometheus-service.monitoring.svc.cluster.local:9090. Keeping these aligned avoids confusion for operators.

💡 Suggested fix
-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
+// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:9090

642-663: Avoid send-on-closed panic during shutdown.

collectAllPVCMetrics can still emit snapshots while Stop() closes batchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to exit) before closing the channel.

🔧 Suggested fix
 type PersistentVolumeClaimMetricsCollector struct {
 	// ...
 	mu              sync.RWMutex
+	wg              sync.WaitGroup
 }
-	go c.collectMetricsLoop(ctx)
+	c.wg.Add(1)
+	go func() {
+		defer c.wg.Done()
+		c.collectMetricsLoop(ctx)
+	}()
-	if c.batchChan != nil {
+	c.wg.Wait()
+	if c.batchChan != nil {
 		close(c.batchChan)
 		c.batchChan = nil
 		c.logger.Info("Closed PVC metrics collector batch input channel")
 	}

681-685: Initialize Prometheus API before availability check.

IsAvailable() returns false when prometheusAPI is nil, but prometheusAPI is only initialized in Start(). If IsAvailable() is used to decide registration, the collector can never be registered. Lazy-init in IsAvailable() (or move initialization into the constructor) and guard with sync.Once/mutex to avoid races.

🔧 Suggested fix (lazy init)
 func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
 	if c.prometheusAPI == nil {
-		c.logger.Info("Prometheus API not available for PVC metrics")
-		return false
+		httpClient := NewPrometheusClient(c.metrics)
+		client, err := api.NewClient(api.Config{
+			Address: c.config.PrometheusURL,
+			Client:  httpClient,
+		})
+		if err != nil {
+			c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error())
+			return false
+		}
+		c.prometheusAPI = v1.NewAPI(client)
 	}
#!/bin/bash
# Verify where IsAvailable() and Start() are called to confirm registration order.
rg -n "IsAvailable\(" internal -C3
rg -n "\.Start\(" internal -C3

Comment on lines +134 to +143
c.telemetryLogger.Report(
gen.LogLevel_LOG_LEVEL_ERROR,
"PVCMetricsCollector",
"Failed to create Prometheus client",
err,
map[string]string{
"prometheus_url": c.config.PrometheusURL,
"zxporter_version": version.Get().String(),
},
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard telemetryLogger in the error path to avoid a nil panic.

telemetryLogger is optional elsewhere, but here it’s dereferenced unconditionally. If it’s nil and Prometheus client creation fails, Start() will panic instead of returning the error.

🛠️ Suggested fix
-	c.telemetryLogger.Report(
-		gen.LogLevel_LOG_LEVEL_ERROR,
-		"PVCMetricsCollector",
-		"Failed to create Prometheus client",
-		err,
-		map[string]string{
-			"prometheus_url":   c.config.PrometheusURL,
-			"zxporter_version": version.Get().String(),
-		},
-	)
+	if c.telemetryLogger != nil {
+		c.telemetryLogger.Report(
+			gen.LogLevel_LOG_LEVEL_ERROR,
+			"PVCMetricsCollector",
+			"Failed to create Prometheus client",
+			err,
+			map[string]string{
+				"prometheus_url":   c.config.PrometheusURL,
+				"zxporter_version": version.Get().String(),
+			},
+		)
+	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
c.telemetryLogger.Report(
gen.LogLevel_LOG_LEVEL_ERROR,
"PVCMetricsCollector",
"Failed to create Prometheus client",
err,
map[string]string{
"prometheus_url": c.config.PrometheusURL,
"zxporter_version": version.Get().String(),
},
)
if c.telemetryLogger != nil {
c.telemetryLogger.Report(
gen.LogLevel_LOG_LEVEL_ERROR,
"PVCMetricsCollector",
"Failed to create Prometheus client",
err,
map[string]string{
"prometheus_url": c.config.PrometheusURL,
"zxporter_version": version.Get().String(),
},
)
}
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 134 - 143, In
PVCMetricsCollector's Start (or the block where Prometheus client creation
fails) avoid calling c.telemetryLogger.Report unconditionally; guard the call
with a nil check (if c.telemetryLogger != nil { c.telemetryLogger.Report(...) })
or fallback to an existing non-telemetry logger (e.g., c.logger) so a nil
telemetryLogger won't cause a panic when Prometheus client creation returns an
error; ensure the function still returns the original error after logging.

Comment on lines +165 to +170
c.informerFactory.Start(c.stopCh)

c.logger.Info("Waiting for informer caches to sync")
if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
return fmt.Errorf("timed out waiting for caches to sync")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Clean up informers when cache sync fails.

If WaitForCacheSync fails, the informer factory has already started and will keep goroutines running. Consider closing stopCh (or calling Stop()) before returning the error to avoid leaks.

🛠️ Suggested fix
	if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
+		select {
+		case <-c.stopCh:
+		default:
+			close(c.stopCh)
+		}
 		return fmt.Errorf("timed out waiting for caches to sync")
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
c.informerFactory.Start(c.stopCh)
c.logger.Info("Waiting for informer caches to sync")
if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
return fmt.Errorf("timed out waiting for caches to sync")
}
c.informerFactory.Start(c.stopCh)
c.logger.Info("Waiting for informer caches to sync")
if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
select {
case <-c.stopCh:
default:
close(c.stopCh)
}
return fmt.Errorf("timed out waiting for caches to sync")
}
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 165 - 170, The
informer factory is started via c.informerFactory.Start(c.stopCh) but if
cache.WaitForCacheSync(...) returns false you currently return an error without
stopping informers; before returning, close the stop channel (or call the
factory stop if available) to signal goroutine shutdown. Update the error path
in the block that calls cache.WaitForCacheSync(c.stopCh,
c.pvcInformer.HasSynced, c.pvInformer.HasSynced) to close c.stopCh (or invoke
the informer factory stop) and then return the fmt.Errorf("timed out waiting for
caches to sync") to prevent informer goroutine leaks.

Comment on lines +504 to +553
// Prometheus queries for PVC volume stats
queries := map[string]string{
"used": fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
"capacity": fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
"available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
}

usage := &filesystemUsage{}
queryTime := time.Now()

result, _, err := c.prometheusAPI.Query(ctx, queries["used"], queryTime)
if err != nil {
return nil, fmt.Errorf("failed to query used bytes: %w", err)
}

if result.Type() == model.ValVector {
vector := result.(model.Vector)
if len(vector) > 0 {
usage.UsedBytes = int64(vector[0].Value)
} else {
return nil, fmt.Errorf("no used bytes metric found")
}
}

result, _, err = c.prometheusAPI.Query(ctx, queries["capacity"], queryTime)
if err != nil {
return nil, fmt.Errorf("failed to query capacity bytes: %w", err)
}

if result.Type() == model.ValVector {
vector := result.(model.Vector)
if len(vector) > 0 {
usage.CapacityBytes = int64(vector[0].Value)
} else {
return nil, fmt.Errorf("no capacity bytes metric found")
}
}

result, _, err = c.prometheusAPI.Query(ctx, queries["available"], queryTime)
if err != nil {
return nil, fmt.Errorf("failed to query available bytes: %w", err)
}

if result.Type() == model.ValVector {
vector := result.(model.Vector)
if len(vector) > 0 {
usage.AvailableBytes = int64(vector[0].Value)
} else {
return nil, fmt.Errorf("no available bytes metric found")
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

kubelet_volume_stats metrics multiple series per PVC Prometheus aggregation

💡 Result:

kubelet_volume_stats_* are emitted by each kubelet (i.e., per scrape target / node). So you can legitimately see multiple time series for the same (namespace, persistentvolumeclaim) when the same PVC is visible from more than one kubelet scrape target (e.g., multi-attach volumes, RWX volumes mounted on multiple nodes, or duplicate scraping / relabeling that creates multiple targets for the same node). The “canonical” identifying labels for these metrics are namespace + persistentvolumeclaim (and then you’ll typically also have target labels like instance, node, job, etc.). [1], [2]

PromQL: de-duplicate / collapse to 1 series per PVC

In most setups, each kubelet reports the same bytes values for a given PVC, so you usually want max (not sum) across the per-node/target duplicates:

max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes)

PVC usage percent (1 series per PVC):

100 *
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)
/
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)

(You’ll often see examples using sum by (...) for per-PVC rollups; if you have duplicates, prefer max to avoid overcounting.) [3]

If your “namespace” label looks wrong (common kube-prometheus-stack gotcha)

If you scrape kubelet with honorLabels: true, Prometheus can rename clashing labels to exported_namespace, etc. In that case you may need to filter/aggregate on exported_namespace (or disable honorLabels for that scrape). [4]

Sources:
[1] Kubernetes docs: Volume stats metrics use namespace and persistentvolumeclaim labels (example: health status metric) https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/
[2] GKE kubelet metrics reference showing kubelet_volume_stats_* label set includes namespace and persistentvolumeclaim https://cloud.google.com/kubernetes-engine/docs/how-to/cadvisor-kubelet-metrics
[3] StackOverflow examples of aggregating kubelet_volume_stats_* by (namespace, persistentvolumeclaim) https://stackoverflow.com/questions/44718268/how-to-monitor-disk-usage-of-kubernetes-persistent-volumes
[4] Discussion of exported_namespace / honorLabels causing misleading PVC labels https://www.reddit.com/r/PrometheusMonitoring/comments/17l21h3/information_about_kubernetes_pvcs_are_wrong/


🏁 Script executed:

# Examine the code file to understand the context better
head -n 560 internal/collector/pvc_metrics_collector.go | tail -n 70

Repository: devzero-inc/zxporter

Length of output: 2378


🏁 Script executed:

# Check if there are tests or examples in the codebase
fd -e go pvc_metrics | head -20

Repository: devzero-inc/zxporter

Length of output: 141


🏁 Script executed:

# Look for any existing aggregation or handling patterns in the codebase
rg "sum by|topk|max|min" -t go | grep -i pvc | head -20

Repository: devzero-inc/zxporter

Length of output: 543


Use max aggregation for PVC metrics; enforce single-sample validation to avoid incorrect results from arbitrary sample selection.

These metrics are emitted per kubelet instance, producing multiple series for the same PVC when mounted on multiple nodes (common for RWX volumes). Taking vector[0] selects arbitrary data, and non-vector results are silently ignored, leaving zero values. Aggregate with max by (namespace, persistentvolumeclaim) (not sum) and enforce exactly one sample per query.

🛠️ Suggested fix
-	queries := map[string]string{
-		"used":      fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
-		"capacity":  fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
-		"available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
-	}
+	queries := map[string]string{
+		"used":      fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+		"capacity":  fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+		"available": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+	}
-	if result.Type() == model.ValVector {
-		vector := result.(model.Vector)
-		if len(vector) > 0 {
-			usage.UsedBytes = int64(vector[0].Value)
-		} else {
-			return nil, fmt.Errorf("no used bytes metric found")
-		}
-	}
+	vector, ok := result.(model.Vector)
+	if !ok {
+		return nil, fmt.Errorf("unexpected result type for used bytes: %s", result.Type())
+	}
+	if len(vector) != 1 {
+		return nil, fmt.Errorf("expected 1 sample for used bytes, got %d", len(vector))
+	}
+	usage.UsedBytes = int64(vector[0].Value)
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 504 - 553, The
Prometheus queries in the PVC metrics collector currently pick vector[0] and can
return multiple series; change the query strings in the queries map to use max
aggregation by PVC (e.g. `max by (namespace,
persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for "used",
"capacity" and "available", then in the code that calls c.prometheusAPI.Query
(the blocks that inspect result.Type() == model.ValVector and convert to
model.Vector) enforce that the returned vector has exactly one sample
(len(vector) == 1) and return a clear error if it has 0 or >1 samples; also
treat non-vector result types as errors rather than silently ignoring them
(affects prometheusAPI.Query handling and population of filesystemUsage).

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
internal/collector/interface.go (1)

90-111: Do not insert new constants mid-iota; append to preserve existing numeric values.

Inserting PersistentVolumeClaimMetrics between PersistentVolume and StorageClass renumbers all subsequent constants, breaking compatibility if these integers are serialized or stored externally. The codebase explicitly converts ResourceType to protobuf via .ProtoType() and transmits values over gRPC (internal/transport/dakr_client.go), confirming that numeric values matter. The existing comments "leaving here to not screw up enum numbering" on later constants show this concern is known. Move PersistentVolumeClaimMetrics to the end of the enum instead.

🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 681-694: The IsAvailable method on
PersistentVolumeClaimMetricsCollector incorrectly always returns true; change it
to return false when the Prometheus API is unavailable or the query fails: in
PersistentVolumeClaimMetricsCollector.IsAvailable, when c.prometheusAPI == nil
log the condition and return false, and after the Query call if err != nil log
the error and return false; otherwise return true. Ensure you reference the
prometheusAPI field and the IsAvailable method so callers in
collectionpolicy_controller.go will correctly skip unavailable collectors.

Comment on lines +681 to +694
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
if c.prometheusAPI == nil {
c.logger.Info("Prometheus API not available for PVC metrics")
return true
}

queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()

_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
if err != nil {
c.logger.Info("Prometheus API not responding", "error", err.Error())
return true
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /root && find . -type f -name "pvc_metrics_collector.go" | head -5

Repository: devzero-inc/zxporter

Length of output: 119


🏁 Script executed:

find . -type f -name "pvc_metrics_collector.go" -exec wc -l {} \;

Repository: devzero-inc/zxporter

Length of output: 113


🏁 Script executed:

find . -type f -name "pvc_metrics_collector.go" -exec cat -n {} \; | sed -n '675,700p'

Repository: devzero-inc/zxporter

Length of output: 946


🏁 Script executed:

rg "IsAvailable" --type go -B 2 -A 2 | grep -A 3 -B 3 "pvc_metrics_collector\|PersistentVolumeClaimMetricsCollector"

Repository: devzero-inc/zxporter

Length of output: 715


🏁 Script executed:

rg "\.IsAvailable\(" --type go

Repository: devzero-inc/zxporter

Length of output: 261


🏁 Script executed:

cat -n internal/controller/collectionpolicy_controller.go | grep -B 5 -A 5 "IsAvailable"

Repository: devzero-inc/zxporter

Length of output: 960


IsAvailable currently always returns true, preventing availability gating from working.

The method logs error conditions at lines 683 and 692 but still returns true, which contradicts the intent. Callers at line 1494 and 2659 of collectionpolicy_controller.go rely on IsAvailable returning false to skip unavailable collectors. By always returning true, unavailable collectors (when Prometheus API is nil or unresponsive) still get registered, defeating the availability check.

🛠️ Suggested fix
 	if c.prometheusAPI == nil {
 		c.logger.Info("Prometheus API not available for PVC metrics")
-		return true
+		return false
 	}
@@
 	if err != nil {
 		c.logger.Info("Prometheus API not responding", "error", err.Error())
-		return true
+		return false
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
if c.prometheusAPI == nil {
c.logger.Info("Prometheus API not available for PVC metrics")
return true
}
queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
if err != nil {
c.logger.Info("Prometheus API not responding", "error", err.Error())
return true
}
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
if c.prometheusAPI == nil {
c.logger.Info("Prometheus API not available for PVC metrics")
return false
}
queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
_, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now())
if err != nil {
c.logger.Info("Prometheus API not responding", "error", err.Error())
return false
}
return true
}
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 694, The
IsAvailable method on PersistentVolumeClaimMetricsCollector incorrectly always
returns true; change it to return false when the Prometheus API is unavailable
or the query fails: in PersistentVolumeClaimMetricsCollector.IsAvailable, when
c.prometheusAPI == nil log the condition and return false, and after the Query
call if err != nil log the error and return false; otherwise return true. Ensure
you reference the prometheusAPI field and the IsAvailable method so callers in
collectionpolicy_controller.go will correctly skip unavailable collectors.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/controller/collectionpolicy_controller.go (2)

693-695: Missing pvc_metrics in affected collectors when ExcludedPVCs changes.

The pvc_metrics collector constructor takes ExcludedPVCs as a parameter (line 2334), but when ExcludedPVCs changes, only "persistent_volume_claim" is marked as affected. The pvc_metrics collector will not be restarted and will continue using stale exclusion rules.

🐛 Proposed fix
 if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) {
     affectedCollectors["persistent_volume_claim"] = true
+    affectedCollectors["pvc_metrics"] = true
 }

831-837: Missing pvc_metrics when Prometheus configuration changes.

The pvc_metrics collector queries Prometheus (via PrometheusURL) and uses UpdateInterval, but it's not included in the affected collectors when these values change. This means configuration changes won't take effect until a full restart.

🐛 Proposed fix
 if oldConfig.UpdateInterval != newConfig.UpdateInterval ||
     oldConfig.PrometheusURL != newConfig.PrometheusURL ||
     oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics ||
     oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics {
     affectedCollectors["node"] = true
     affectedCollectors["container_resource"] = true
+    affectedCollectors["pvc_metrics"] = true
 }
🤖 Fix all issues with AI agents
In `@helm-chart/zxporter/local_values.yaml`:
- Around line 31-36: Remove the hard-coded personal access token by clearing or
replacing the patToken value in the Helm values so no real secret is committed;
locate the patToken entry and set it to an empty string or a non-secret
placeholder (e.g., "") and add a comment directing operators to supply the real
PAT via a secret or external secret management at deploy time (reference:
patToken).

Comment thread helm-chart/zxporter/local_values.yaml Outdated
@f-leu f-leu added the changelog:added New features label Jan 28, 2026
@Rupam-It Rupam-It merged commit 0cdb3c1 into main Jan 28, 2026
26 checks passed
@Rupam-It Rupam-It deleted the rupam/zxporter—PVCStorageUsageMetricsCollector branch January 28, 2026 10:45
Parthiba-Hazra pushed a commit that referenced this pull request May 5, 2026
* pvc metrics collector added

* endpoint fixed

* make genrate-proto fix every thing

* fix lint error

* local deployment values deleted

---------
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog:added New features

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants