pvc metrics collector added#252
Hidden character warning
Conversation
📝 WalkthroughWalkthroughAdds a new PersistentVolumeClaimMetrics resource type, a strongly-typed PVC metrics snapshot, a Prometheus-backed PVC metrics collector with informer-driven listing and batching, controller wiring to register/start/stop the collector, and a local Helm values file. Changes
Sequence Diagram(s)sequenceDiagram
participant Controller
participant Collector as PVC Metrics Collector
participant K8s as Kubernetes Informers
participant Prometheus
participant Batcher
Controller->>Collector: Start(ctx)
Collector->>K8s: Init informers & WaitForCacheSync
K8s-->>Collector: Cache ready (PVCs)
Collector->>Batcher: Start batcher
Collector->>Collector: Launch collectMetricsLoop
loop Periodic collection
Collector->>K8s: List PVCs (namespace-scoped or all)
K8s-->>Collector: PVC list
Collector->>Collector: Filter exclusions / skip unbound
loop Per-PVC
alt filesystem-backed
Collector->>Prometheus: Query used/capacity/available
Prometheus-->>Collector: metrics or errors
else block-backed
Collector-->>Collector: mark stats unavailable
end
Collector->>Batcher: emitSnapshot(snapshot)
end
Batcher-->>Controller: Send batched snapshots -> output channel
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 4❌ Failed checks (2 warnings, 2 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/collector/interface.go (1)
207-309: Add ProtoType mapping forPersistentVolumeClaimMetrics.Without a case in
ProtoType(), this new resource type will serialize asRESOURCE_TYPE_UNSPECIFIED, so pvc-metrics resources will be mislabeled on the wire.🔧 Proposed fix
case PersistentVolume: return gen.ResourceType_RESOURCE_TYPE_PERSISTENT_VOLUME +case PersistentVolumeClaimMetrics: + return gen.ResourceType_RESOURCE_TYPE_PVC_METRICS // use the actual proto enum name case StorageClass: return gen.ResourceType_RESOURCE_TYPE_STORAGE_CLASSinternal/controller/collectionpolicy_controller.go (1)
658-661: Config changes won’t restartpvc_metrics.
identifyAffectedCollectorsupdates PVC resource collectors, but doesn’t includepvc_metricsfor PVC exclusion or Prometheus/update-interval changes, so metrics can keep stale config until a full restart.🔧 Proposed fix
if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) { affectedCollectors["persistent_volume_claim"] = true + affectedCollectors["pvc_metrics"] = true } if oldConfig.UpdateInterval != newConfig.UpdateInterval || oldConfig.PrometheusURL != newConfig.PrometheusURL || oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics || oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics { affectedCollectors["node"] = true affectedCollectors["container_resource"] = true + affectedCollectors["pvc_metrics"] = true }Also applies to: 787-794
🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 176-188: Shutdown can race: collectAllPVCMetrics (from
collectMetricsLoop) may send to batchChan after Stop() closes it causing a
send-on-closed panic; change shutdown to coordinate goroutines so batchChan is
closed only after the collector loop exits. Modify
collectMetricsLoop/collectAllPVCMetrics coordination: add a sync.WaitGroup or a
dedicated done channel tracked by the collector (refer to collectMetricsLoop,
collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have Stop signal the
loop to exit (via stopCh or context) and then wait for the goroutine to finish
before closing batchChan and stopping the ticker; alternatively close batchChan
from inside the collector goroutine after the loop ends instead of from Stop.
- Around line 24-35: Update the doc comment for
PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to match the actual
default used by the constructor (the in-cluster URL
"http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the
struct PersistentVolumeClaimMetricsCollectorConfig and replace the current
comment that says "http://prometheus.monitoring:9090" with the correct default
URL so the documentation matches the code path that sets the default in the
collector constructor that initializes PrometheusURL.
- Around line 681-697: IsAvailable on PersistentVolumeClaimMetricsCollector
currently returns false when prometheusAPI is nil causing a registration
deadlock; fix by lazily initializing prometheusAPI inside IsAvailable (or move
initialization into the constructor) before issuing the test Query: call the
same client creation routine used by Start (or a new helper like
initPrometheusClient), handle and log initialization errors, and ensure
concurrent safety by using a mutex or sync.Once around prometheusAPI assignment
so multiple callers don't race; after successful init proceed with the Query and
return true/false based on the Query result.
| // PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector | ||
| type PersistentVolumeClaimMetricsCollectorConfig struct { | ||
| // UpdateInterval specifies how often to collect PVC metrics | ||
| UpdateInterval time.Duration | ||
|
|
||
| // PrometheusURL specifies the URL of the Prometheus instance to query | ||
| // If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090 | ||
| PrometheusURL string | ||
|
|
||
| // QueryTimeout specifies the timeout for Prometheus queries | ||
| QueryTimeout time.Duration | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the pvc_metrics_collector.go file
find . -name "pvc_metrics_collector.go" -type fRepository: devzero-inc/zxporter
Length of output: 109
🏁 Script executed:
# Read the file to verify the doc comment and constructor
cat -n ./internal/collector/pvc_metrics_collector.goRepository: devzero-inc/zxporter
Length of output: 25479
Update doc comment to match the actual default Prometheus URL.
The config comment at line 30 states the default is http://prometheus.monitoring:9090, but the constructor at line 84 sets it to http://prometheus-service.monitoring.svc.cluster.local:8080. Update the comment to reflect the actual implementation.
Suggested fix
-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090
+// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector | |
| type PersistentVolumeClaimMetricsCollectorConfig struct { | |
| // UpdateInterval specifies how often to collect PVC metrics | |
| UpdateInterval time.Duration | |
| // PrometheusURL specifies the URL of the Prometheus instance to query | |
| // If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090 | |
| PrometheusURL string | |
| // QueryTimeout specifies the timeout for Prometheus queries | |
| QueryTimeout time.Duration | |
| } | |
| // PersistentVolumeClaimMetricsCollectorConfig holds configuration for the PVC metrics collector | |
| type PersistentVolumeClaimMetricsCollectorConfig struct { | |
| // UpdateInterval specifies how often to collect PVC metrics | |
| UpdateInterval time.Duration | |
| // PrometheusURL specifies the URL of the Prometheus instance to query | |
| // If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:8080 | |
| PrometheusURL string | |
| // QueryTimeout specifies the timeout for Prometheus queries | |
| QueryTimeout time.Duration | |
| } |
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 24 - 35, Update the
doc comment for PersistentVolumeClaimMetricsCollectorConfig.PrometheusURL to
match the actual default used by the constructor (the in-cluster URL
"http://prometheus-service.monitoring.svc.cluster.local:8080"); locate the
struct PersistentVolumeClaimMetricsCollectorConfig and replace the current
comment that says "http://prometheus.monitoring:9090" with the correct default
URL so the documentation matches the code path that sets the default in the
collector constructor that initializes PrometheusURL.
| c.ticker = time.NewTicker(c.config.UpdateInterval) | ||
|
|
||
| go c.collectMetricsLoop(ctx) | ||
|
|
||
| stopCh := c.stopCh | ||
| go func() { | ||
| select { | ||
| case <-ctx.Done(): | ||
| c.Stop() | ||
| case <-stopCh: | ||
| } | ||
| }() | ||
|
|
There was a problem hiding this comment.
Prevent send-on-closed panic during shutdown.
collectAllPVCMetrics can still emit snapshots while Stop() closes batchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to finish) before closing the channel.
🔧 Suggested fix
type PersistentVolumeClaimMetricsCollector struct {
// ...
mu sync.RWMutex
+ wg sync.WaitGroup
}
- go c.collectMetricsLoop(ctx)
+ c.wg.Add(1)
+ go func() {
+ defer c.wg.Done()
+ c.collectMetricsLoop(ctx)
+ }()
// ...
select {
case <-c.stopCh:
c.logger.Info("PVC metrics collector stop channel already closed")
default:
close(c.stopCh)
c.logger.Info("Closed PVC metrics collector stop channel")
}
- if c.batchChan != nil {
+ c.wg.Wait()
+ if c.batchChan != nil {
close(c.batchChan)
c.batchChan = nil
c.logger.Info("Closed PVC metrics collector batch input channel")
}Also applies to: 559-579, 642-668
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 176 - 188, Shutdown
can race: collectAllPVCMetrics (from collectMetricsLoop) may send to batchChan
after Stop() closes it causing a send-on-closed panic; change shutdown to
coordinate goroutines so batchChan is closed only after the collector loop
exits. Modify collectMetricsLoop/collectAllPVCMetrics coordination: add a
sync.WaitGroup or a dedicated done channel tracked by the collector (refer to
collectMetricsLoop, collectAllPVCMetrics, Stop, batchChan, stopCh, ticker), have
Stop signal the loop to exit (via stopCh or context) and then wait for the
goroutine to finish before closing batchChan and stopping the ticker;
alternatively close batchChan from inside the collector goroutine after the loop
ends instead of from Stop.
| func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { | ||
| if c.prometheusAPI == nil { | ||
| c.logger.Info("Prometheus API not available for PVC metrics") | ||
| return false | ||
| } | ||
|
|
||
| queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
| defer cancel() | ||
|
|
||
| _, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now()) | ||
| if err != nil { | ||
| c.logger.Info("Prometheus API not responding", "error", err.Error()) | ||
| return false | ||
| } | ||
|
|
||
| return true | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find the file and understand its structure
find . -name "pvc_metrics_collector.go" -type fRepository: devzero-inc/zxporter
Length of output: 184
🏁 Script executed:
# Search for the Start() method in the collector to see where prometheusAPI is initialized
rg -A 20 "func \(c \*PersistentVolumeClaimMetricsCollector\) Start" internal/collector/pvc_metrics_collector.goRepository: devzero-inc/zxporter
Length of output: 727
🏁 Script executed:
# Search for where IsAvailable is called to understand the lifecycle
rg "IsAvailable" --type go -B 3 -A 3Repository: devzero-inc/zxporter
Length of output: 29640
🏁 Script executed:
# Look for the NewPersistentVolumeClaimMetricsCollector constructor
rg -A 15 "func NewPersistentVolumeClaimMetricsCollector" internal/collector/pvc_metrics_collector.goRepository: devzero-inc/zxporter
Length of output: 614
🏁 Script executed:
# Get more context from the controller file about the registration flow
rg -A 30 "if c.collector.IsAvailable" internal/controller/collectionpolicy_controller.goRepository: devzero-inc/zxporter
Length of output: 1203
🏁 Script executed:
# Look for when Start() is called on collectors
rg "\.Start\(" internal/controller/collectionpolicy_controller.go -B 5 -A 5Repository: devzero-inc/zxporter
Length of output: 1287
🏁 Script executed:
# Check the complete collector registration and initialization sequence
rg -A 50 "RegisterCollector" internal/collector/interface.goRepository: devzero-inc/zxporter
Length of output: 46
🏁 Script executed:
# Look for where collectors are created and when Start is called in the reconciliation
rg -B 5 -A 15 "func.*Reconcile" internal/controller/collectionpolicy_controller.go | head -80Repository: devzero-inc/zxporter
Length of output: 3330
🏁 Script executed:
# Search for RegisterCollector implementation to see if it calls Start()
rg -A 30 "func.*RegisterCollector" internal/Repository: devzero-inc/zxporter
Length of output: 1965
🏁 Script executed:
# Check if there's any other place prometheusAPI is initialized
rg "prometheusAPI =" internal/collector/pvc_metrics_collector.go -B 2 -A 2Repository: devzero-inc/zxporter
Length of output: 519
🏁 Script executed:
# Look for any initialization of prometheusAPI before Start()
rg "prometheusAPI" internal/collector/pvc_metrics_collector.go | head -20Repository: devzero-inc/zxporter
Length of output: 475
🏁 Script executed:
# Get full context of the PVC metrics collector constructor and fields
rg -B 5 -A 40 "type PersistentVolumeClaimMetricsCollector struct" internal/collector/pvc_metrics_collector.goRepository: devzero-inc/zxporter
Length of output: 1737
🏁 Script executed:
# Search for where Start() is called on registered collectors
rg "\.Start\(" internal/collector/manager.go -B 5 -A 10Repository: devzero-inc/zxporter
Length of output: 620
🏁 Script executed:
# Look for the full reconciliation flow to see when collectors are started
rg -B 10 -A 20 "registerResourceCollectors" internal/controller/collectionpolicy_controller.go | head -100Repository: devzero-inc/zxporter
Length of output: 2905
Lazy-initialize Prometheus API client in IsAvailable().
The collector cannot be registered because prometheusAPI is only initialized in Start(), but IsAvailable() is called before Start(). When IsAvailable() returns false due to prometheusAPI being nil, the collector is never registered, so Start() is never called, creating a deadlock where the collector can never run.
Initialize the Prometheus client in IsAvailable() if not already initialized (or move initialization to the constructor) to break this cycle.
🔧 Proposed fix (lazy init)
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool {
if c.prometheusAPI == nil {
- c.logger.Info("Prometheus API not available for PVC metrics")
- return false
+ httpClient := NewPrometheusClient(c.metrics)
+ client, err := api.NewClient(api.Config{
+ Address: c.config.PrometheusURL,
+ Client: httpClient,
+ })
+ if err != nil {
+ c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error())
+ return false
+ }
+ c.prometheusAPI = v1.NewAPI(client)
}🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 697,
IsAvailable on PersistentVolumeClaimMetricsCollector currently returns false
when prometheusAPI is nil causing a registration deadlock; fix by lazily
initializing prometheusAPI inside IsAvailable (or move initialization into the
constructor) before issuing the test Query: call the same client creation
routine used by Start (or a new helper like initPrometheusClient), handle and
log initialization errors, and ensure concurrent safety by using a mutex or
sync.Once around prometheusAPI assignment so multiple callers don't race; after
successful init proceed with the Query and return true/false based on the Query
result.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 134-143: In PVCMetricsCollector's Start (or the block where
Prometheus client creation fails) avoid calling c.telemetryLogger.Report
unconditionally; guard the call with a nil check (if c.telemetryLogger != nil {
c.telemetryLogger.Report(...) }) or fallback to an existing non-telemetry logger
(e.g., c.logger) so a nil telemetryLogger won't cause a panic when Prometheus
client creation returns an error; ensure the function still returns the original
error after logging.
- Around line 165-170: The informer factory is started via
c.informerFactory.Start(c.stopCh) but if cache.WaitForCacheSync(...) returns
false you currently return an error without stopping informers; before
returning, close the stop channel (or call the factory stop if available) to
signal goroutine shutdown. Update the error path in the block that calls
cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced,
c.pvInformer.HasSynced) to close c.stopCh (or invoke the informer factory stop)
and then return the fmt.Errorf("timed out waiting for caches to sync") to
prevent informer goroutine leaks.
- Around line 504-553: The Prometheus queries in the PVC metrics collector
currently pick vector[0] and can return multiple series; change the query
strings in the queries map to use max aggregation by PVC (e.g. `max by
(namespace, persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for
"used", "capacity" and "available", then in the code that calls
c.prometheusAPI.Query (the blocks that inspect result.Type() == model.ValVector
and convert to model.Vector) enforce that the returned vector has exactly one
sample (len(vector) == 1) and return a clear error if it has 0 or >1 samples;
also treat non-vector result types as errors rather than silently ignoring them
(affects prometheusAPI.Query handling and population of filesystemUsage).
♻️ Duplicate comments (3)
internal/collector/pvc_metrics_collector.go (3)
29-31: Doc comment default URL is out of sync with constructor.The comment still references
http://prometheus.monitoring:9090, but the constructor defaults tohttp://prometheus-service.monitoring.svc.cluster.local:9090. Keeping these aligned avoids confusion for operators.💡 Suggested fix
-// If empty, defaults to in-cluster Prometheus at http://prometheus.monitoring:9090 +// If empty, defaults to in-cluster Prometheus at http://prometheus-service.monitoring.svc.cluster.local:9090
642-663: Avoid send-on-closed panic during shutdown.
collectAllPVCMetricscan still emit snapshots whileStop()closesbatchChan, which can panic. Coordinate shutdown (e.g., wait for the collection goroutine to exit) before closing the channel.🔧 Suggested fix
type PersistentVolumeClaimMetricsCollector struct { // ... mu sync.RWMutex + wg sync.WaitGroup }- go c.collectMetricsLoop(ctx) + c.wg.Add(1) + go func() { + defer c.wg.Done() + c.collectMetricsLoop(ctx) + }()- if c.batchChan != nil { + c.wg.Wait() + if c.batchChan != nil { close(c.batchChan) c.batchChan = nil c.logger.Info("Closed PVC metrics collector batch input channel") }
681-685: Initialize Prometheus API before availability check.
IsAvailable()returns false whenprometheusAPIis nil, butprometheusAPIis only initialized inStart(). IfIsAvailable()is used to decide registration, the collector can never be registered. Lazy-init inIsAvailable()(or move initialization into the constructor) and guard withsync.Once/mutex to avoid races.🔧 Suggested fix (lazy init)
func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { if c.prometheusAPI == nil { - c.logger.Info("Prometheus API not available for PVC metrics") - return false + httpClient := NewPrometheusClient(c.metrics) + client, err := api.NewClient(api.Config{ + Address: c.config.PrometheusURL, + Client: httpClient, + }) + if err != nil { + c.logger.Info("Prometheus API not available for PVC metrics", "error", err.Error()) + return false + } + c.prometheusAPI = v1.NewAPI(client) }#!/bin/bash # Verify where IsAvailable() and Start() are called to confirm registration order. rg -n "IsAvailable\(" internal -C3 rg -n "\.Start\(" internal -C3
| c.telemetryLogger.Report( | ||
| gen.LogLevel_LOG_LEVEL_ERROR, | ||
| "PVCMetricsCollector", | ||
| "Failed to create Prometheus client", | ||
| err, | ||
| map[string]string{ | ||
| "prometheus_url": c.config.PrometheusURL, | ||
| "zxporter_version": version.Get().String(), | ||
| }, | ||
| ) |
There was a problem hiding this comment.
Guard telemetryLogger in the error path to avoid a nil panic.
telemetryLogger is optional elsewhere, but here it’s dereferenced unconditionally. If it’s nil and Prometheus client creation fails, Start() will panic instead of returning the error.
🛠️ Suggested fix
- c.telemetryLogger.Report(
- gen.LogLevel_LOG_LEVEL_ERROR,
- "PVCMetricsCollector",
- "Failed to create Prometheus client",
- err,
- map[string]string{
- "prometheus_url": c.config.PrometheusURL,
- "zxporter_version": version.Get().String(),
- },
- )
+ if c.telemetryLogger != nil {
+ c.telemetryLogger.Report(
+ gen.LogLevel_LOG_LEVEL_ERROR,
+ "PVCMetricsCollector",
+ "Failed to create Prometheus client",
+ err,
+ map[string]string{
+ "prometheus_url": c.config.PrometheusURL,
+ "zxporter_version": version.Get().String(),
+ },
+ )
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| c.telemetryLogger.Report( | |
| gen.LogLevel_LOG_LEVEL_ERROR, | |
| "PVCMetricsCollector", | |
| "Failed to create Prometheus client", | |
| err, | |
| map[string]string{ | |
| "prometheus_url": c.config.PrometheusURL, | |
| "zxporter_version": version.Get().String(), | |
| }, | |
| ) | |
| if c.telemetryLogger != nil { | |
| c.telemetryLogger.Report( | |
| gen.LogLevel_LOG_LEVEL_ERROR, | |
| "PVCMetricsCollector", | |
| "Failed to create Prometheus client", | |
| err, | |
| map[string]string{ | |
| "prometheus_url": c.config.PrometheusURL, | |
| "zxporter_version": version.Get().String(), | |
| }, | |
| ) | |
| } |
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 134 - 143, In
PVCMetricsCollector's Start (or the block where Prometheus client creation
fails) avoid calling c.telemetryLogger.Report unconditionally; guard the call
with a nil check (if c.telemetryLogger != nil { c.telemetryLogger.Report(...) })
or fallback to an existing non-telemetry logger (e.g., c.logger) so a nil
telemetryLogger won't cause a panic when Prometheus client creation returns an
error; ensure the function still returns the original error after logging.
| c.informerFactory.Start(c.stopCh) | ||
|
|
||
| c.logger.Info("Waiting for informer caches to sync") | ||
| if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) { | ||
| return fmt.Errorf("timed out waiting for caches to sync") | ||
| } |
There was a problem hiding this comment.
Clean up informers when cache sync fails.
If WaitForCacheSync fails, the informer factory has already started and will keep goroutines running. Consider closing stopCh (or calling Stop()) before returning the error to avoid leaks.
🛠️ Suggested fix
if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) {
+ select {
+ case <-c.stopCh:
+ default:
+ close(c.stopCh)
+ }
return fmt.Errorf("timed out waiting for caches to sync")
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| c.informerFactory.Start(c.stopCh) | |
| c.logger.Info("Waiting for informer caches to sync") | |
| if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) { | |
| return fmt.Errorf("timed out waiting for caches to sync") | |
| } | |
| c.informerFactory.Start(c.stopCh) | |
| c.logger.Info("Waiting for informer caches to sync") | |
| if !cache.WaitForCacheSync(c.stopCh, c.pvcInformer.HasSynced, c.pvInformer.HasSynced) { | |
| select { | |
| case <-c.stopCh: | |
| default: | |
| close(c.stopCh) | |
| } | |
| return fmt.Errorf("timed out waiting for caches to sync") | |
| } |
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 165 - 170, The
informer factory is started via c.informerFactory.Start(c.stopCh) but if
cache.WaitForCacheSync(...) returns false you currently return an error without
stopping informers; before returning, close the stop channel (or call the
factory stop if available) to signal goroutine shutdown. Update the error path
in the block that calls cache.WaitForCacheSync(c.stopCh,
c.pvcInformer.HasSynced, c.pvInformer.HasSynced) to close c.stopCh (or invoke
the informer factory stop) and then return the fmt.Errorf("timed out waiting for
caches to sync") to prevent informer goroutine leaks.
| // Prometheus queries for PVC volume stats | ||
| queries := map[string]string{ | ||
| "used": fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), | ||
| "capacity": fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), | ||
| "available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name), | ||
| } | ||
|
|
||
| usage := &filesystemUsage{} | ||
| queryTime := time.Now() | ||
|
|
||
| result, _, err := c.prometheusAPI.Query(ctx, queries["used"], queryTime) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to query used bytes: %w", err) | ||
| } | ||
|
|
||
| if result.Type() == model.ValVector { | ||
| vector := result.(model.Vector) | ||
| if len(vector) > 0 { | ||
| usage.UsedBytes = int64(vector[0].Value) | ||
| } else { | ||
| return nil, fmt.Errorf("no used bytes metric found") | ||
| } | ||
| } | ||
|
|
||
| result, _, err = c.prometheusAPI.Query(ctx, queries["capacity"], queryTime) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to query capacity bytes: %w", err) | ||
| } | ||
|
|
||
| if result.Type() == model.ValVector { | ||
| vector := result.(model.Vector) | ||
| if len(vector) > 0 { | ||
| usage.CapacityBytes = int64(vector[0].Value) | ||
| } else { | ||
| return nil, fmt.Errorf("no capacity bytes metric found") | ||
| } | ||
| } | ||
|
|
||
| result, _, err = c.prometheusAPI.Query(ctx, queries["available"], queryTime) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to query available bytes: %w", err) | ||
| } | ||
|
|
||
| if result.Type() == model.ValVector { | ||
| vector := result.(model.Vector) | ||
| if len(vector) > 0 { | ||
| usage.AvailableBytes = int64(vector[0].Value) | ||
| } else { | ||
| return nil, fmt.Errorf("no available bytes metric found") | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
kubelet_volume_stats metrics multiple series per PVC Prometheus aggregation
💡 Result:
kubelet_volume_stats_* are emitted by each kubelet (i.e., per scrape target / node). So you can legitimately see multiple time series for the same (namespace, persistentvolumeclaim) when the same PVC is visible from more than one kubelet scrape target (e.g., multi-attach volumes, RWX volumes mounted on multiple nodes, or duplicate scraping / relabeling that creates multiple targets for the same node). The “canonical” identifying labels for these metrics are namespace + persistentvolumeclaim (and then you’ll typically also have target labels like instance, node, job, etc.). [1], [2]
PromQL: de-duplicate / collapse to 1 series per PVC
In most setups, each kubelet reports the same bytes values for a given PVC, so you usually want max (not sum) across the per-node/target duplicates:
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes)
PVC usage percent (1 series per PVC):
100 *
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes)
/
max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes)
(You’ll often see examples using sum by (...) for per-PVC rollups; if you have duplicates, prefer max to avoid overcounting.) [3]
If your “namespace” label looks wrong (common kube-prometheus-stack gotcha)
If you scrape kubelet with honorLabels: true, Prometheus can rename clashing labels to exported_namespace, etc. In that case you may need to filter/aggregate on exported_namespace (or disable honorLabels for that scrape). [4]
Sources:
[1] Kubernetes docs: Volume stats metrics use namespace and persistentvolumeclaim labels (example: health status metric) https://kubernetes.io/docs/concepts/storage/volume-health-monitoring/
[2] GKE kubelet metrics reference showing kubelet_volume_stats_* label set includes namespace and persistentvolumeclaim https://cloud.google.com/kubernetes-engine/docs/how-to/cadvisor-kubelet-metrics
[3] StackOverflow examples of aggregating kubelet_volume_stats_* by (namespace, persistentvolumeclaim) https://stackoverflow.com/questions/44718268/how-to-monitor-disk-usage-of-kubernetes-persistent-volumes
[4] Discussion of exported_namespace / honorLabels causing misleading PVC labels https://www.reddit.com/r/PrometheusMonitoring/comments/17l21h3/information_about_kubernetes_pvcs_are_wrong/
🏁 Script executed:
# Examine the code file to understand the context better
head -n 560 internal/collector/pvc_metrics_collector.go | tail -n 70Repository: devzero-inc/zxporter
Length of output: 2378
🏁 Script executed:
# Check if there are tests or examples in the codebase
fd -e go pvc_metrics | head -20Repository: devzero-inc/zxporter
Length of output: 141
🏁 Script executed:
# Look for any existing aggregation or handling patterns in the codebase
rg "sum by|topk|max|min" -t go | grep -i pvc | head -20Repository: devzero-inc/zxporter
Length of output: 543
Use max aggregation for PVC metrics; enforce single-sample validation to avoid incorrect results from arbitrary sample selection.
These metrics are emitted per kubelet instance, producing multiple series for the same PVC when mounted on multiple nodes (common for RWX volumes). Taking vector[0] selects arbitrary data, and non-vector results are silently ignored, leaving zero values. Aggregate with max by (namespace, persistentvolumeclaim) (not sum) and enforce exactly one sample per query.
🛠️ Suggested fix
- queries := map[string]string{
- "used": fmt.Sprintf(`kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
- "capacity": fmt.Sprintf(`kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
- "available": fmt.Sprintf(`kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"}`, pvc.Namespace, pvc.Name),
- }
+ queries := map[string]string{
+ "used": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_used_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+ "capacity": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_capacity_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+ "available": fmt.Sprintf(`max by (namespace, persistentvolumeclaim) (kubelet_volume_stats_available_bytes{namespace="%s", persistentvolumeclaim="%s"})`, pvc.Namespace, pvc.Name),
+ }- if result.Type() == model.ValVector {
- vector := result.(model.Vector)
- if len(vector) > 0 {
- usage.UsedBytes = int64(vector[0].Value)
- } else {
- return nil, fmt.Errorf("no used bytes metric found")
- }
- }
+ vector, ok := result.(model.Vector)
+ if !ok {
+ return nil, fmt.Errorf("unexpected result type for used bytes: %s", result.Type())
+ }
+ if len(vector) != 1 {
+ return nil, fmt.Errorf("expected 1 sample for used bytes, got %d", len(vector))
+ }
+ usage.UsedBytes = int64(vector[0].Value)🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 504 - 553, The
Prometheus queries in the PVC metrics collector currently pick vector[0] and can
return multiple series; change the query strings in the queries map to use max
aggregation by PVC (e.g. `max by (namespace,
persistentvolumeclaim)(kubelet_volume_stats_used_bytes{...})`) for "used",
"capacity" and "available", then in the code that calls c.prometheusAPI.Query
(the blocks that inspect result.Type() == model.ValVector and convert to
model.Vector) enforce that the returned vector has exactly one sample
(len(vector) == 1) and return a clear error if it has 0 or >1 samples; also
treat non-vector result types as errors rather than silently ignoring them
(affects prometheusAPI.Query handling and population of filesystemUsage).
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/collector/interface.go (1)
90-111: Do not insert new constants mid-iota; append to preserve existing numeric values.Inserting
PersistentVolumeClaimMetricsbetweenPersistentVolumeandStorageClassrenumbers all subsequent constants, breaking compatibility if these integers are serialized or stored externally. The codebase explicitly convertsResourceTypeto protobuf via.ProtoType()and transmits values over gRPC (internal/transport/dakr_client.go), confirming that numeric values matter. The existing comments "leaving here to not screw up enum numbering" on later constants show this concern is known. MovePersistentVolumeClaimMetricsto the end of the enum instead.
🤖 Fix all issues with AI agents
In `@internal/collector/pvc_metrics_collector.go`:
- Around line 681-694: The IsAvailable method on
PersistentVolumeClaimMetricsCollector incorrectly always returns true; change it
to return false when the Prometheus API is unavailable or the query fails: in
PersistentVolumeClaimMetricsCollector.IsAvailable, when c.prometheusAPI == nil
log the condition and return false, and after the Query call if err != nil log
the error and return false; otherwise return true. Ensure you reference the
prometheusAPI field and the IsAvailable method so callers in
collectionpolicy_controller.go will correctly skip unavailable collectors.
| func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { | ||
| if c.prometheusAPI == nil { | ||
| c.logger.Info("Prometheus API not available for PVC metrics") | ||
| return true | ||
| } | ||
|
|
||
| queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | ||
| defer cancel() | ||
|
|
||
| _, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now()) | ||
| if err != nil { | ||
| c.logger.Info("Prometheus API not responding", "error", err.Error()) | ||
| return true | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /root && find . -type f -name "pvc_metrics_collector.go" | head -5Repository: devzero-inc/zxporter
Length of output: 119
🏁 Script executed:
find . -type f -name "pvc_metrics_collector.go" -exec wc -l {} \;Repository: devzero-inc/zxporter
Length of output: 113
🏁 Script executed:
find . -type f -name "pvc_metrics_collector.go" -exec cat -n {} \; | sed -n '675,700p'Repository: devzero-inc/zxporter
Length of output: 946
🏁 Script executed:
rg "IsAvailable" --type go -B 2 -A 2 | grep -A 3 -B 3 "pvc_metrics_collector\|PersistentVolumeClaimMetricsCollector"Repository: devzero-inc/zxporter
Length of output: 715
🏁 Script executed:
rg "\.IsAvailable\(" --type goRepository: devzero-inc/zxporter
Length of output: 261
🏁 Script executed:
cat -n internal/controller/collectionpolicy_controller.go | grep -B 5 -A 5 "IsAvailable"Repository: devzero-inc/zxporter
Length of output: 960
IsAvailable currently always returns true, preventing availability gating from working.
The method logs error conditions at lines 683 and 692 but still returns true, which contradicts the intent. Callers at line 1494 and 2659 of collectionpolicy_controller.go rely on IsAvailable returning false to skip unavailable collectors. By always returning true, unavailable collectors (when Prometheus API is nil or unresponsive) still get registered, defeating the availability check.
🛠️ Suggested fix
if c.prometheusAPI == nil {
c.logger.Info("Prometheus API not available for PVC metrics")
- return true
+ return false
}
@@
if err != nil {
c.logger.Info("Prometheus API not responding", "error", err.Error())
- return true
+ return false
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { | |
| if c.prometheusAPI == nil { | |
| c.logger.Info("Prometheus API not available for PVC metrics") | |
| return true | |
| } | |
| queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | |
| defer cancel() | |
| _, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now()) | |
| if err != nil { | |
| c.logger.Info("Prometheus API not responding", "error", err.Error()) | |
| return true | |
| } | |
| func (c *PersistentVolumeClaimMetricsCollector) IsAvailable(ctx context.Context) bool { | |
| if c.prometheusAPI == nil { | |
| c.logger.Info("Prometheus API not available for PVC metrics") | |
| return false | |
| } | |
| queryCtx, cancel := context.WithTimeout(ctx, 10*time.Second) | |
| defer cancel() | |
| _, _, err := c.prometheusAPI.Query(queryCtx, "up", time.Now()) | |
| if err != nil { | |
| c.logger.Info("Prometheus API not responding", "error", err.Error()) | |
| return false | |
| } | |
| return true | |
| } |
🤖 Prompt for AI Agents
In `@internal/collector/pvc_metrics_collector.go` around lines 681 - 694, The
IsAvailable method on PersistentVolumeClaimMetricsCollector incorrectly always
returns true; change it to return false when the Prometheus API is unavailable
or the query fails: in PersistentVolumeClaimMetricsCollector.IsAvailable, when
c.prometheusAPI == nil log the condition and return false, and after the Query
call if err != nil log the error and return false; otherwise return true. Ensure
you reference the prometheusAPI field and the IsAvailable method so callers in
collectionpolicy_controller.go will correctly skip unavailable collectors.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/controller/collectionpolicy_controller.go (2)
693-695: Missingpvc_metricsin affected collectors whenExcludedPVCschanges.The
pvc_metricscollector constructor takesExcludedPVCsas a parameter (line 2334), but whenExcludedPVCschanges, only"persistent_volume_claim"is marked as affected. Thepvc_metricscollector will not be restarted and will continue using stale exclusion rules.🐛 Proposed fix
if !reflect.DeepEqual(oldConfig.ExcludedPVCs, newConfig.ExcludedPVCs) { affectedCollectors["persistent_volume_claim"] = true + affectedCollectors["pvc_metrics"] = true }
831-837: Missingpvc_metricswhen Prometheus configuration changes.The
pvc_metricscollector queries Prometheus (viaPrometheusURL) and usesUpdateInterval, but it's not included in the affected collectors when these values change. This means configuration changes won't take effect until a full restart.🐛 Proposed fix
if oldConfig.UpdateInterval != newConfig.UpdateInterval || oldConfig.PrometheusURL != newConfig.PrometheusURL || oldConfig.DisableNetworkIOMetrics != newConfig.DisableNetworkIOMetrics || oldConfig.DisableGPUMetrics != newConfig.DisableGPUMetrics { affectedCollectors["node"] = true affectedCollectors["container_resource"] = true + affectedCollectors["pvc_metrics"] = true }
🤖 Fix all issues with AI agents
In `@helm-chart/zxporter/local_values.yaml`:
- Around line 31-36: Remove the hard-coded personal access token by clearing or
replacing the patToken value in the Helm values so no real secret is committed;
locate the patToken entry and set it to an empty string or a non-secret
placeholder (e.g., "") and add a comment directing operators to supply the real
PAT via a secret or external secret management at deploy time (reference:
patToken).
* pvc metrics collector added * endpoint fixed * make genrate-proto fix every thing * fix lint error * local deployment values deleted ---------
pvc metrics collector added
📚 Description of Changes
In this PR , I have extend the core functionality of metrics collector with adding a new metrics collector of PVC, Now zxporter can also send pvc along side of container metrics!
What Changed:
I have a new pvc collector ! that will collect metrics of pvc stuff!
Why This Change:
this will change the extendability for add metrics for pvc as well !