CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305
CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305cwilkers wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: LGTM mode |
|
@cwilkers: This pull request references CNTRLPLANE-2539 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThe PR removes RBAC Role emission from cmd/cluster/agent/create.go and moves Role management into the agent controller. The agent controller now creates/updates an rbacv1.Role named capi-provider-role granting * on agent-install.openshift.io/agents, creates/updates a RoleBinding referencing that Role, and on credential deletion deletes the Role only when no remaining RoleBindings in the agent namespace reference it. Tests were added/updated to cover creation, idempotency, and shared-role deletion semantics. Sequence Diagram(s)sequenceDiagram
participant HostedCluster as HostedCluster
participant AgentController as Agent Controller
participant K8sAPI as Kubernetes API
participant Role as Role (capi-provider-role)
participant RoleBinding as RoleBinding
HostedCluster->>AgentController: ReconcileCredentials(request)
AgentController->>K8sAPI: Get HostedCluster & namespace info
K8sAPI-->>AgentController: HostedCluster data
AgentController->>K8sAPI: Create/Update Role (agents on agent-install.openshift.io, verbs: *)
K8sAPI-->>Role: Role created/updated
AgentController->>K8sAPI: Create/Update RoleBinding -> RoleRef: capi-provider-role
K8sAPI-->>RoleBinding: RoleBinding created/updated
alt Delete credentials flow
AgentController->>K8sAPI: Delete RoleBinding(s)
K8sAPI-->>AgentController: RoleBinding deletion results
AgentController->>K8sAPI: List RoleBindings in agent namespace
K8sAPI-->>AgentController: RoleBindings list
alt No remaining RoleBindings referencing capi-provider-role
AgentController->>K8sAPI: Delete Role (capi-provider-role)
K8sAPI-->>Role: Role deleted
else Remaining references exist
Note right of AgentController: Preserve shared Role
end
end
🚥 Pre-merge checks | ✅ 10 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: cwilkers The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-236: The current DeleteIfNeeded call in agent.go removes the
shared Role named by CAPIProviderRoleName in the agent namespace which can break
other HostedClusters; instead, change the cleanup to either (A) skip deleting
the Role entirely, (B) make the Role name unique per controlPlaneNamespace, or
(C) before calling hyperutil.DeleteIfNeeded for the Role (the call that
constructs &rbacv1.Role{... Name: CAPIProviderRoleName, Namespace:
hc.Spec.Platform.Agent.AgentNamespace}), list RoleBindings in
hc.Spec.Platform.Agent.AgentNamespace and only delete the Role if no
RoleBinding.RoleRef (or Subjects) references CAPIProviderRoleName; implement the
RoleBinding check using the client to List rbacv1.RoleBinding objects and
inspect RoleRef.Name/Kind and Subjects to decide safety of deletion.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 9f07eb43-14f4-4111-b02b-4c2718081b20
⛔ Files ignored due to path filters (1)
cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
cmd/cluster/agent/create.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #8305 +/- ##
==========================================
+ Coverage 37.50% 37.53% +0.02%
==========================================
Files 751 751
Lines 91992 92049 +57
==========================================
+ Hits 34505 34549 +44
- Misses 54844 54853 +9
- Partials 2643 2647 +4
... and 4 files with indirect coverage changes
Flags with carried forward coverage won't be shown. Click here to find out more. 🚀 New features to boost your workflow:
|
668e301 to
aa074db
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 231-245: The Role may be left orphaned because the cached List can
still return the just-deleted RoleBinding; update the loop that scans
roleBindings.Items (after DeleteIfNeeded) to ignore entries that are being
deleted or are the exact binding we just removed: skip items with non-nil
DeletionTimestamp and skip items whose ObjectMeta.Name equals
fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace); also tighten
the match to require roleBindings.Items[i].RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName so you only return early for real, active
RoleBindings. Ensure these checks are applied where roleBindings is iterated
before deciding not to delete the Role.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: ca68a83e-fe57-41cf-983f-f50ed82118a8
⛔ Files ignored due to path filters (1)
cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
cmd/cluster/agent/create.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/cluster/agent/create.go
- hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)
234-245:⚠️ Potential issue | 🟡 MinorSkip stale/deleting RoleBindings before preserving the shared Role.
Line 239 can still match the RoleBinding just deleted on Line 231 when the cached client has not observed the deletion yet, causing the final cleanup to return early and leave
capi-provider-roleorphaned. Also includeRoleRef.APIGroupso only active RBAC RoleRefs preserve the Role.Suggested tightening of the cleanup guard
for i := range roleBindings.Items { - if roleBindings.Items[i].RoleRef.Kind == "Role" && roleBindings.Items[i].RoleRef.Name == CAPIProviderRoleName { + roleBinding := &roleBindings.Items[i] + if roleBinding.Name == fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace) || roleBinding.DeletionTimestamp != nil { + continue + } + if roleBinding.RoleRef.APIGroup == "rbac.authorization.k8s.io" && + roleBinding.RoleRef.Kind == "Role" && + roleBinding.RoleRef.Name == CAPIProviderRoleName { return nil } }Run this read-only check to confirm whether this path is using a cached manager client in production:
#!/bin/bash # Description: Inspect controller wiring to see whether DeleteCredentials receives a cached controller-runtime client. # Expect: If mgr.GetClient() or reconciler Client is passed through, cached List behavior should be assumed. rg -n -C4 'DeleteCredentials\s*\(|ReconcileCredentials\s*\(|mgr\.GetClient\(\)|GetAPIReader\(\)|client\.New\(' .🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go` around lines 234 - 245, The loop that preserves the shared Role can erroneously match RoleBindings that are being deleted or from other API groups; update the check over roleBindings.Items in agent.go to skip items with a non-nil metadata.DeletionTimestamp and require RoleRef.APIGroup == "rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList / roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks); leave the hyperutil.DeleteIfNeeded call for the Role (constructed with CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the role is only preserved by active, same-APIGroup RoleBindings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-245: The loop that preserves the shared Role can erroneously
match RoleBindings that are being deleted or from other API groups; update the
check over roleBindings.Items in agent.go to skip items with a non-nil
metadata.DeletionTimestamp and require RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList /
roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks);
leave the hyperutil.DeleteIfNeeded call for the Role (constructed with
CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the
role is only preserved by active, same-APIGroup RoleBindings.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 63d4fe37-2e1d-4837-be44-6aaf3ff1d451
⛔ Files ignored due to path filters (1)
cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
cmd/cluster/agent/create.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
- cmd/cluster/agent/create.go
- hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
aa074db to
ceef355
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)
228-254: Deletion guard looks good; consider logging the skip for observability.The updated
DeleteCredentialscorrectly addresses prior concerns: it skips the just-deleted RoleBinding by name, ignores bindings with a non-nilDeletionTimestamp, and tightens the match withRoleRef.APIGroup == "rbac.authorization.k8s.io"before short-circuiting. This correctly preserves the sharedcapi-provider-rolewhen another HostedCluster in the same agent namespace still references it.Two small follow-ups to consider (non-blocking):
- When skipping role deletion (Line 248), a debug/info log indicating "role retained — still referenced by RoleBinding X" would help operators diagnose why a Role lingers in the agent namespace after an HC teardown.
- The early
return nilon Line 248 exits on the first referencing binding, which is correct, but since the iteration order ofroleBindings.Itemsis not deterministic you won't be able to tell which binding held it. Capturing the name in a log as suggested above mitigates this.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go` around lines 228 - 254, Add an observability log before early-returning from DeleteCredentials to show which RoleBinding prevented role deletion: inside the loop over roleBindings.Items in DeleteCredentials, when you detect a live referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" && rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context) to emit an Info/Debug message like "retaining CAPI provider role; referenced by RoleBinding <rb.Name>" and then return nil; also consider logging when you skip the just-deleted binding (deletedBindingName) or when skipping due to rb.DeletionTimestamp != nil for extra observability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: Add an observability log before early-returning from
DeleteCredentials to show which RoleBinding prevented role deletion: inside the
loop over roleBindings.Items in DeleteCredentials, when you detect a live
referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the
controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context)
to emit an Info/Debug message like "retaining CAPI provider role; referenced by
RoleBinding <rb.Name>" and then return nil; also consider logging when you skip
the just-deleted binding (deletedBindingName) or when skipping due to
rb.DeletionTimestamp != nil for extra observability.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: e1264a9a-e9bb-4549-9298-afdeed2950c2
⛔ Files ignored due to path filters (1)
cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
cmd/cluster/agent/create.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
ceef355 to
01425d8
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go (1)
73-106: Solid idempotency coverage; optionally assert rule content, not just length.The test correctly verifies the second
ReconcileCredentialscall does not produce an error or duplicate rules. As a small hardening, consider also asserting the rule content on the second read (APIGroups/Resources/Verbs) so a regression that replaces the rule with a different-but-still-length-1 rule would be caught.♻️ Optional strengthening of idempotency assertion
g.Expect(err).ToNot(HaveOccurred()) g.Expect(role.Rules).To(HaveLen(1)) + g.Expect(role.Rules[0].APIGroups).To(Equal([]string{"agent-install.openshift.io"})) + g.Expect(role.Rules[0].Resources).To(Equal([]string{"agents"})) + g.Expect(role.Rules[0].Verbs).To(Equal([]string{"*"})) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go` around lines 73 - 106, The test TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently only asserts role.Rules length; after fetching the Role (variable role from client.Get using CAPIProviderRoleName) add assertions that the single rule's fields match the expected APIGroups, Resources and Verbs (e.g. check role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal the canonical slices expected by ReconcileCredentials) so a replacement-with-different-rule regression is caught; keep these assertions after the second ReconcileCredentials call to validate idempotent content as well as count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: The DeleteCredentials function has a race where a
RoleBinding can be created after the cached List completes but before we
DeleteIfNeeded the Role; to fix, before calling hyperutil.DeleteIfNeeded for the
Role (in DeleteCredentials), perform either a second uncached server-side List
(use c.List with client.InNamespace and client.DirectClient or use
client.Options with metav1.ListOptions) to re-check for any live RoleBindings
referencing CAPIProviderRoleName (skip deletedBindingName and DeletionTimestamp
as already done), or add and use a deterministic label on Role/RoleBinding in
ReconcileCredentials and then List using that label selector server-side so only
relevant bindings are considered; if the second check finds any matching
RoleBindings return nil, otherwise proceed to delete via
hyperutil.DeleteIfNeeded.
---
Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go`:
- Around line 73-106: The test
TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently
only asserts role.Rules length; after fetching the Role (variable role from
client.Get using CAPIProviderRoleName) add assertions that the single rule's
fields match the expected APIGroups, Resources and Verbs (e.g. check
role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal
the canonical slices expected by ReconcileCredentials) so a
replacement-with-different-rule regression is caught; keep these assertions
after the second ReconcileCredentials call to validate idempotent content as
well as count.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Pro Plus
Run ID: 8f0b5551-1ad6-45ef-b804-379415b3ac99
⛔ Files ignored due to path filters (1)
cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yamlis excluded by!**/testdata/**
📒 Files selected for processing (3)
cmd/cluster/agent/create.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.gohypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
|
Thanks for working on this @cwilkers — I understand the motivation for ZTP/gitops workflows. I have a few concerns about the current approach that I'd like to discuss before we settle on the right solution:
Moving the Role creation into Would it make sense to either (a) check for existence before calling createOrUpdate, or (b) move the Role creation to a one-time setup in the operator startup (e.g., alongside other platformbootstrapping)?
When two HostedClusters share the same agentNamespace and are deleted concurrently, the following race can occur:
The double-delete itself is safe (DeleteIfNeeded is idempotent), but the problem is the Role disappearing while a CAPI provider still needs it to release agents during teardown. The deletion order
I see your point on the ZTP use case but concerns 1 and 2 would need to be addressed before this can go forward. Let me know your toughts. |
|
With some planning help by Claude, I can move the Role creation out of the ReconcileCredentials function where it doesn't make that much sense, and into the reconcileCAPIProvider function where it can be gated to only create if the capi deployment does not exist yet. I'll need to test. For the deletion race condition, would it be an option to simply not have the operator delete the Role? It's not as clean, but a NS deletion will clear it and it doesn't really grant access to anything we would worry about. The other option suggested by AI would be to add multiple finalizers to the Role, which seems clunky to me. |
Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI and into the operator. Now the operator will handle the role creation, and gitops based deployments need not be unconstrained to be able to create Roles. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: cwilkers <cwilkers@redhat.com>
|
/lgtm |
|
Scheduling tests matching the |
|
/test e2e-aws |
|
The PR changes are explicitly gated behind Test Failure Analysis CompleteJob Information
Test Failure AnalysisErrorSummaryBoth failures are unrelated to the PR changes and are pre-existing flaky test issues. PR #8305 modifies CAPI Provider Role generation exclusively for the Agent platform (all new code is gated behind Root CauseFailure 1: TestKarpenter/Main/AutoNode_enable/disable_lifecycle The test disables AutoNode (Karpenter) on the HostedCluster, verifies it reaches Failure 2: TestCreateCluster/Teardown The TestCreateCluster/Main phase and all EnsureHostedCluster validation checks passed successfully. The failure occurred only during teardown when attempting to destroy the hosted cluster and clean up AWS infrastructure. The HostedCluster finalizer didn't complete within the timeout ("hostedcluster wasn't finalized, aborting delete: context deadline exceeded"), leaving 4 orphaned AWS resources: 1 NLB (router-default) and 3 EBS volumes (one per AZ for the HA node machines). This is a teardown-only flake caused by slow AWS resource deletion, not a functional test failure. Neither failure is caused by PR #8305. The PR modifies:
All changes are Agent-platform-specific code paths that are never executed during AWS e2e tests. Recommendations
Evidence
|
|
/test e2e-aws-4-22 |
|
@cwilkers: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
CNTRLPLANE-2539: Move generation of CAPI Provider Role from cli to operator
What this PR does / why we need it:
Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI or manual steps and into the operator. Now the operator will handle the role creation, and gitops based deployments do not require relaxing security restrictions to create Roles.
Which issue(s) this PR fixes:
Fixes CNTRLPLANE-2539
Special notes for your reviewer:
Checklist:
AI Assistance
Summary by CodeRabbit
Refactor
Tests