[zxp] adding monitoring for more karpenter types#184
Conversation
WalkthroughThis PR adds Karpenter resource support: RBAC rules expanded for aksnodeclasses, gcenodeclasses, ocinodeclasses, and nodeoverlays; RBAC markers updated; and the internal collector registers informers and maps the new kinds/versions. Changes
Sequence Diagram(s)sequenceDiagram
participant Collector as Collector
participant Informer as Dynamic Informer Factory
participant K8sAPI as Kubernetes API
Note over Collector,Informer: Register new Karpenter kinds: aksnodeclasses, gcenodeclasses, ocinodeclasses, nodeoverlays
Collector->>Informer: Register(group/version/resource)
Informer->>K8sAPI: Create/watch informer for resource
K8sAPI-->>Informer: Resource events (ADDED/UPDATED/DELETED)
Informer-->>Collector: Deliver events
Note over Collector: Existing processing pipeline handles events
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
🔇 Additional comments (7)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 golangci-lint (2.5.0)Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
internal/collector/karpenter_collector.go (2)
144-150: LGTM! OciNodeClass resource correctly defined.The OciNodeClass definition is correct with proper group, version, and resource values. Documentation links to the Zoom karpenter-oci project are helpful.
Similar to other new node classes, this will use generic processing which may not capture OCI-specific configurations like shape definitions, boot volume settings, or compartment IDs.
368-384: Consider adding provider-specific processing functions for new node classes.Currently, the new node class resources (AKSNodeClass, GCENodeClass, OciNodeClass, NodeOverlay) default to
processGenericResourcewhich only extracts common metadata fields. While this is acceptable for initial support, you could enhance observability by adding provider-specific processing functions similar toprocessEC2NodeClass(line 545) orprocessAWSNodeTemplate(line 517).Provider-specific functions could extract:
- AKSNodeClass: VM sizes, OS disk configuration, subnet/network configuration
- GCENodeClass: Machine types, disk configurations, network tags, service accounts
- OciNodeClass: Shape configurations, image IDs, boot volumes, compartment settings
- NodeOverlay: Overlay specifications and taints
This would provide richer monitoring data for these cloud providers.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
dist/backend-install.yamlis excluded by!**/dist/**dist/install.yamlis excluded by!**/dist/**dist/installer_updater.yamlis excluded by!**/dist/**dist/zxporter.yamlis excluded by!**/dist/**
📒 Files selected for processing (3)
config/rbac/role.yaml(2 hunks)internal/collector/karpenter_collector.go(2 hunks)internal/controller/collectionpolicy_controller.go(1 hunks)
🔇 Additional comments (9)
internal/controller/collectionpolicy_controller.go (1)
212-217: LGTM! RBAC markers are correctly added for new Karpenter resources.The kubebuilder RBAC markers for nodeoverlays, aksnodeclasses, ocinodeclasses, and gcenodeclasses are correctly formatted and consistent with the existing patterns. The
get;list;watchverbs are appropriate for read-only monitoring of these resources.internal/collector/karpenter_collector.go (4)
122-129: LGTM! AKSNodeClass v1alpha2 resource correctly defined.The resource definition is correct with proper group, version, resource, and kind values. The included documentation links are helpful for reference.
Note: This resource will use generic processing via
processGenericResource(line 383), which may not capture Azure-specific fields like subnet selectors, VM size constraints, or OS disk configuration. This is acceptable for initial support.
137-143: LGTM! NodeOverlay resource correctly added.The NodeOverlay resource definition is correct with helpful documentation links. This resource will default to generic processing, which should be sufficient for capturing the basic overlay configuration.
151-157: LGTM! GCENodeClass resource correctly added.The GCENodeClass definition is correct with appropriate documentation links to the cloudpilot-ai/karpenter-provider-gcp project.
This resource will use generic processing, which may not extract GCP-specific fields such as machine type configurations, disk settings, or service account configurations.
175-181: LGTM! AKSNodeClass v1beta1 resource correctly added.Supporting both v1alpha2 (line 124) and v1beta1 versions of AKSNodeClass follows the established pattern seen with AWS resources, ensuring compatibility across different Karpenter provider versions.
config/rbac/role.yaml (4)
131-138: LGTM! RBAC rules for AKSNodeClass correctly added.The ClusterRole rules for
karpenter.azure.com/aksnodeclassesare correctly structured with appropriate read-only verbs. This authorizes monitoring of both v1alpha2 and v1beta1 AKSNodeClass resources defined in the collector.
148-155: LGTM! RBAC rules for GCENodeClass correctly added.The ClusterRole rules for
karpenter.k8s.gcp/gcenodeclassesare correctly defined, matching the controller RBAC marker and collector resource definition.
156-163: LGTM! RBAC rules for OciNodeClass correctly added.The ClusterRole rules for
karpenter.k8s.oracle/ocinodeclassesare properly structured and consistent with the controller and collector definitions.
164-175: LGTM! NodeOverlay resource correctly added to karpenter.sh rules.The
nodeoverlaysresource is correctly added to the existingkarpenter.shClusterRole rule, maintaining alphabetical ordering among the resources.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/collector/karpenter_collector.go (1)
122-181: Critical handler gaps: Three v1beta1 resources not handled in determineKarpenterResourceTypeThe resources slice registers 12 resource types but
determineKarpenterResourceTypeis missing handlers for:
- NodePool v1beta1 (only has case for v1)
- NodeClaim v1beta1 (only has cases for v1alpha5 and v1)
- AKSNodeClass v1beta1 (only has case for v1alpha2)
If these resources are encountered during collection, the function will hit the default case and return an error, failing resource collection.
🧹 Nitpick comments (1)
internal/collector/karpenter_collector.go (1)
368-384: Consider adding specific processors for new resource types.The newly added resource types (AKSNodeClass, NodeOverlay, OciNodeClass, GCENodeClass) will use the default generic processing. While this works for basic monitoring, these provider-specific resources may have important spec fields that could benefit from custom extraction—similar to how
processAWSNodeTemplateandprocessEC2NodeClassextract provider-specific details like instance types, subnet selectors, etc.For example, AKSNodeClass likely has Azure-specific fields (VM size, network configuration), and GCENodeClass would have GCP-specific fields (machine types, service accounts).
Consider adding dedicated processor methods for better observability:
processAKSNodeClassfor Azure-specific fieldsprocessNodeOverlayfor overlay configurationprocessOciNodeClassfor Oracle-specific fieldsprocessGCENodeClassfor GCP-specific fieldsThis can be deferred if generic processing meets current requirements.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/collector/karpenter_collector.go(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Check if build-installer produces changes
- GitHub Check: Run make test
- GitHub Check: Build Docker Image
- GitHub Check: Analyze (go)
a7dc91d to
50ae2fc
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
internal/collector/karpenter_collector.go (1)
868-875: Missing AKSNodeClass v1beta1 mapping in determineKarpenterResourceType.Add the v1beta1 case to match the informer list; otherwise AddResource and any code path relying on detection will error for v1beta1 AKSNodeClass. This was flagged earlier and is still applicable.
Apply:
// azure specific case kind == "AKSNodeClass" && strings.Contains(apiVersion, "karpenter.azure.com/v1alpha2"): return KarpenterResource{ GroupVersion: schema.GroupVersion{Group: "karpenter.azure.com", Version: "v1alpha2"}, Resource: "aksnodeclasses", Kind: "AKSNodeClass", }, nil + case kind == "AKSNodeClass" && strings.Contains(apiVersion, "karpenter.azure.com/v1beta1"): + return KarpenterResource{ + GroupVersion: schema.GroupVersion{Group: "karpenter.azure.com", Version: "v1beta1"}, + Resource: "aksnodeclasses", + Kind: "AKSNodeClass", + }, nil#!/bin/bash rg -n 'karpenter\.azure\.com/v1beta1' -n internal/collector/karpenter_collector.go || echo "Mapping for AKS v1beta1 not found in switch"
🧹 Nitpick comments (2)
internal/collector/karpenter_collector.go (2)
137-157: New NodeOverlay, OCI, and GCE watchers — LGTM, with optional processing later.Generic processing is fine for now. Consider provider-specific processors later if you need richer fields.
807-894: Prefer exact Group/Version parsing over substring checks.Using strings.Contains on apiVersion risks false positives. Parse once and compare exact group/version for these cases.
Example:
gv, err := schema.ParseGroupVersion(apiVersion) if err != nil { return KarpenterResource{}, fmt.Errorf("invalid apiVersion: %w", err) } if kind == "AKSNodeClass" && gv.Group == "karpenter.azure.com" && gv.Version == "v1beta1" { ... }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (4)
dist/backend-install.yamlis excluded by!**/dist/**dist/install.yamlis excluded by!**/dist/**dist/installer_updater.yamlis excluded by!**/dist/**dist/zxporter.yamlis excluded by!**/dist/**
📒 Files selected for processing (4)
config/rbac/role.yaml(2 hunks)helm-chart/zxporter/templates/zxporter-rbac.yaml(2 hunks)internal/collector/karpenter_collector.go(4 hunks)internal/controller/collectionpolicy_controller.go(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- internal/controller/collectionpolicy_controller.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build Docker Image
- GitHub Check: Build Docker Image
- GitHub Check: Run make test
- GitHub Check: Analyze (go)
🔇 Additional comments (4)
helm-chart/zxporter/templates/zxporter-rbac.yaml (1)
251-258: Helm RBAC mirrors kustomize RBAC — good.Provider NodeClass groups and NodeOverlay added with read-only verbs. Keep these sections synchronized with config/rbac/role.yaml to prevent permission mismatches at deploy time.
Reuse the parity script from the other comment to verify both manifests are consistent in CI.
Also applies to: 269-276, 277-283, 289-289
internal/collector/karpenter_collector.go (2)
121-130: AKS v1alpha2 watcher added — LGTM.Informer will no-op if CRD absent; safe addition.
175-181: Watching AKS v1beta1 — ensure type mapping exists.The Start list includes AKSNodeClass v1beta1, but the type switch lacks the corresponding mapping (see below). Add it to keep AddResource and any direct processing consistent.
config/rbac/role.yaml (1)
131-138: RBAC parity confirmed across both files.Verification passed:
config/rbac/role.yamlandhelm-chart/zxporter/templates/zxporter-rbac.yamlcontain identical Karpenter RBAC rules for all five API groups (azure.com, k8s.aws, k8s.gcp, k8s.oracle, karpenter.sh) with matching resources and verbs. No drift detected.
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
internal/collector/karpenter_collector.go (1)
165-169: Critical: Missing v1beta1 NodeClaim mapping.The v1beta1 NodeClaim informer is created here, but there's no corresponding case in
determineKarpenterResourceType. This will cause theAddResourcemethod to fail when processing v1beta1 NodeClaim resources.Apply this diff to add the missing v1beta1 case:
case kind == "NodeClaim" && strings.Contains(apiVersion, "karpenter.sh/v1alpha5"): return KarpenterResource{ GroupVersion: schema.GroupVersion{Group: "karpenter.sh", Version: "v1alpha5"}, Resource: "nodeclaims", Kind: "NodeClaim", }, nil case kind == "NodeClaim" && strings.Contains(apiVersion, "karpenter.sh/v1"): return KarpenterResource{ GroupVersion: schema.GroupVersion{Group: "karpenter.sh", Version: "v1"}, Resource: "nodeclaims", Kind: "NodeClaim", }, nil + case kind == "NodeClaim" && strings.Contains(apiVersion, "karpenter.sh/v1beta1"): + return KarpenterResource{ + GroupVersion: schema.GroupVersion{Group: "karpenter.sh", Version: "v1beta1"}, + Resource: "nodeclaims", + Kind: "NodeClaim", + }, nil
♻️ Duplicate comments (1)
internal/collector/karpenter_collector.go (1)
175-181: Critical: Missing v1beta1 AKSNodeClass mapping.Despite the previous review comment indicating this was addressed, the v1beta1 AKSNodeClass case is still missing from
determineKarpenterResourceType. Only the v1alpha2 version is mapped (lines 875-880). This will cause theAddResourcemethod to fail when processing v1beta1 AKSNodeClass resources.Apply this diff to add the missing v1beta1 case:
// azure specific case kind == "AKSNodeClass" && strings.Contains(apiVersion, "karpenter.azure.com/v1alpha2"): return KarpenterResource{ GroupVersion: schema.GroupVersion{Group: "karpenter.azure.com", Version: "v1alpha2"}, Resource: "aksnodeclasses", Kind: "AKSNodeClass", }, nil + case kind == "AKSNodeClass" && strings.Contains(apiVersion, "karpenter.azure.com/v1beta1"): + return KarpenterResource{ + GroupVersion: schema.GroupVersion{Group: "karpenter.azure.com", Version: "v1beta1"}, + Resource: "aksnodeclasses", + Kind: "AKSNodeClass", + }, nil
🧹 Nitpick comments (1)
internal/collector/karpenter_collector.go (1)
368-384: Consider adding dedicated processing functions for new resource types.The new resources (AKSNodeClass, OciNodeClass, GCENodeClass, NodeOverlay) will fall through to
processGenericResource, which only extracts common metadata. For richer monitoring, you may want to add dedicated processing functions (similar toprocessEC2NodeClassandprocessAWSNodeTemplate) to extract provider-specific fields from specs and status.For example:
case "AKSNodeClass": processedObj = c.processAKSNodeClass(obj) case "OciNodeClass": processedObj = c.processOciNodeClass(obj) case "GCENodeClass": processedObj = c.processGCENodeClass(obj) case "NodeOverlay": processedObj = c.processNodeOverlay(obj)This can be implemented incrementally as monitoring requirements evolve.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
internal/collector/karpenter_collector.go(4 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Run make test
- GitHub Check: Analyze (go)
🔇 Additional comments (3)
internal/collector/karpenter_collector.go (3)
122-129: LGTM! AKSNodeClass v1alpha2 correctly configured.The resource definition is complete and has a corresponding mapping in
determineKarpenterResourceType(lines 875-880). The GitHub reference comments are helpful.
137-157: LGTM! New cloud provider resources correctly configured.All three new resources (NodeOverlay, OciNodeClass, GCENodeClass) are properly defined with:
- Correct GroupVersion, Resource, and Kind fields
- Corresponding mappings in
determineKarpenterResourceType(lines 835-840, 883-888, 891-896)- Helpful GitHub reference comments
835-896: LGTM! New resource type mappings are well-structured.The additions to
determineKarpenterResourceTypeare properly organized by cloud provider and correctly map the new resource types that have complete definitions. The comment-based organization ("default types", "aws specific", "azure specific", etc.) improves readability.Note: The missing v1beta1 cases for NodeClaim and AKSNodeClass are flagged in separate comments.
* [zxp] adding monitoring for more karpenter types * added nodeoverlays to collectionpolicy_controller.go and ran `make build-installer` * added aksnodeclasses to collectionpolicy_controller.go and ran `make build-installer` * added ocinodeclasses to collectionpolicy_controller.go and ran `make build-installer` * added gcenodeclasses to collectionpolicy_controller.go and ran `make build-installer` * update helm chart `zxporter-rbac.yaml` based on contents of `config/rbac/role.yaml` * add logic for the snapshotter workflow * fix to snapshot workflow * add spec and status to the generic karpenter resource parser
Summary by CodeRabbit
New Features
Chores