Skip to content

CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305

Open
cwilkers wants to merge 1 commit intoopenshift:mainfrom
cwilkers:CNTRLPLANE-2539
Open

CNTRLPLANE-2539: Move generation of the CAPI Provider Role#8305
cwilkers wants to merge 1 commit intoopenshift:mainfrom
cwilkers:CNTRLPLANE-2539

Conversation

@cwilkers
Copy link
Copy Markdown

@cwilkers cwilkers commented Apr 22, 2026

CNTRLPLANE-2539: Move generation of CAPI Provider Role from cli to operator

What this PR does / why we need it:

Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI or manual steps and into the operator. Now the operator will handle the role creation, and gitops based deployments do not require relaxing security restrictions to create Roles.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2539

Special notes for your reviewer:

Checklist:

  • [ + ] Subject and description added to both, commit and PR.
  • [ + ] Relevant issues have been referenced.
  • [ ? ] This change includes docs.
  • [ + ] This change includes unit tests.

AI Assistance

  • Model Used: Claude with claude-sonnet-4@20250514 in CLI and in VS Code IDE integration
  • Scope: Planning, Test writing, and Code
  • Level: Code generation and Code assistance
  • Human Review: Quick code review by @cwilkers, manual testing of multiple use cases on local cluster

Summary by CodeRabbit

  • Refactor

    • RBAC role emission removed from the legacy creation path; agent credential lifecycle now centralizes role management and only deletes a shared role when no remaining bindings reference it.
  • Tests

    • Added and extended tests to verify shared role creation, idempotent reconciliation, preservation across multiple hosted clusters, and deletion once unreferenced.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 22, 2026

@cwilkers: This pull request references CNTRLPLANE-2539 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

CNTRLPLANE-2539: Move generation of CAPI Provider Role from cli to operator

What this PR does / why we need it:

Supports work in ZTP deployment of HCP clusters by taking Role generation (which is a security risk in gitops workflows) out of the CLI or manual steps and into the operator. Now the operator will handle the role creation, and gitops based deployments do not require relaxing security restrictions to create Roles.

Which issue(s) this PR fixes:

Fixes CNTRLPLANE-2539

Special notes for your reviewer:

Checklist:

  • [ + ] Subject and description added to both, commit and PR.
  • [ + ] Relevant issues have been referenced.
  • [ ? ] This change includes docs.
  • [ + ] This change includes unit tests.

AI Assistance

  • Model Used: Claude with claude-sonnet-4@20250514 in CLI and in VS Code IDE integration
  • Scope: Planning, Test writing, and Code
  • Level: Code generation and Code assistance
  • Human Review: Quick code review by @cwilkers, manual testing of multiple use cases on local cluster

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR removes RBAC Role emission from cmd/cluster/agent/create.go and moves Role management into the agent controller. The agent controller now creates/updates an rbacv1.Role named capi-provider-role granting * on agent-install.openshift.io/agents, creates/updates a RoleBinding referencing that Role, and on credential deletion deletes the Role only when no remaining RoleBindings in the agent namespace reference it. Tests were added/updated to cover creation, idempotency, and shared-role deletion semantics.

Sequence Diagram(s)

sequenceDiagram
    participant HostedCluster as HostedCluster
    participant AgentController as Agent Controller
    participant K8sAPI as Kubernetes API
    participant Role as Role (capi-provider-role)
    participant RoleBinding as RoleBinding

    HostedCluster->>AgentController: ReconcileCredentials(request)
    AgentController->>K8sAPI: Get HostedCluster & namespace info
    K8sAPI-->>AgentController: HostedCluster data
    AgentController->>K8sAPI: Create/Update Role (agents on agent-install.openshift.io, verbs: *)
    K8sAPI-->>Role: Role created/updated
    AgentController->>K8sAPI: Create/Update RoleBinding -> RoleRef: capi-provider-role
    K8sAPI-->>RoleBinding: RoleBinding created/updated

    alt Delete credentials flow
        AgentController->>K8sAPI: Delete RoleBinding(s)
        K8sAPI-->>AgentController: RoleBinding deletion results
        AgentController->>K8sAPI: List RoleBindings in agent namespace
        K8sAPI-->>AgentController: RoleBindings list
        alt No remaining RoleBindings referencing capi-provider-role
            AgentController->>K8sAPI: Delete Role (capi-provider-role)
            K8sAPI-->>Role: Role deleted
        else Remaining references exist
            Note right of AgentController: Preserve shared Role
        end
    end
Loading
🚥 Pre-merge checks | ✅ 10 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack descriptive assertion messages (93% missing), making failures difficult to diagnose. Add descriptive failure messages to all assertions throughout the four test functions for improved maintainability and debugging.
✅ Passed checks (10 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: moving CAPI Provider Role generation from CLI to operator, which is reflected in all three modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Tests use standard Go testing conventions with testing.T, not Ginkgo framework, so Ginkgo naming guidelines do not apply.
Microshift Test Compatibility ✅ Passed The custom check applies only to new Ginkgo e2e tests using Ginkgo patterns. The tests added are standard Go unit tests following the func TestXxx(t *testing.T) convention, not Ginkgo e2e tests.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests added are standard Go unit tests with fake clients, not Ginkgo e2e tests. No SNO compatibility check required.
Topology-Aware Scheduling Compatibility ✅ Passed PR changes move RBAC Role generation from CLI to operator; this is authorization logic only, not scheduling constraints
Ote Binary Stdout Contract ✅ Passed The pull request contains no stdout writes in process-level code. The three changed files are a CLI platform package, an operator reconciler, and standard Go unit tests with no fmt.Print*, log.Print*, or klog calls.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed Custom check for IPv6 and disconnected network test compatibility applies only to Ginkgo e2e tests. The tests in this PR are standard Go unit tests using testing.T framework with a fake Kubernetes client, have no Ginkgo imports, and do not run in IPv6-only disconnected CI environments, making this check not applicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: cwilkers
Once this PR has been reviewed and has the lgtm label, please assign cblecker for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release and removed do-not-merge/needs-area labels Apr 22, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-236: The current DeleteIfNeeded call in agent.go removes the
shared Role named by CAPIProviderRoleName in the agent namespace which can break
other HostedClusters; instead, change the cleanup to either (A) skip deleting
the Role entirely, (B) make the Role name unique per controlPlaneNamespace, or
(C) before calling hyperutil.DeleteIfNeeded for the Role (the call that
constructs &rbacv1.Role{... Name: CAPIProviderRoleName, Namespace:
hc.Spec.Platform.Agent.AgentNamespace}), list RoleBindings in
hc.Spec.Platform.Agent.AgentNamespace and only delete the Role if no
RoleBinding.RoleRef (or Subjects) references CAPIProviderRoleName; implement the
RoleBinding check using the client to List rbacv1.RoleBinding objects and
inspect RoleRef.Name/Kind and Subjects to decide safety of deletion.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 9f07eb43-14f4-4111-b02b-4c2718081b20

📥 Commits

Reviewing files that changed from the base of the PR and between f2fd2ca and 051ef9f.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 22, 2026

Codecov Report

❌ Patch coverage is 60.41667% with 19 lines in your changes missing coverage. Please review.
✅ Project coverage is 37.53%. Comparing base (2fc8a13) to head (edbb80e).
⚠️ Report is 24 commits behind head on main.

Files with missing lines Patch % Lines
...ers/hostedcluster/internal/platform/agent/agent.go 64.28% 9 Missing and 6 partials ⚠️
...trollers/hostedcluster/hostedcluster_controller.go 20.00% 2 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8305      +/-   ##
==========================================
+ Coverage   37.50%   37.53%   +0.02%     
==========================================
  Files         751      751              
  Lines       91992    92049      +57     
==========================================
+ Hits        34505    34549      +44     
- Misses      54844    54853       +9     
- Partials     2643     2647       +4     
Files with missing lines Coverage Δ
cmd/cluster/agent/create.go 49.38% <100.00%> (-9.21%) ⬇️
...trollers/hostedcluster/hostedcluster_controller.go 43.21% <20.00%> (-0.02%) ⬇️
...ers/hostedcluster/internal/platform/agent/agent.go 42.48% <64.28%> (+4.42%) ⬆️

... and 4 files with indirect coverage changes

Flag Coverage Δ
cmd-support 32.72% <100.00%> (+0.03%) ⬆️
cpo-hostedcontrolplane 36.77% <ø> (ø)
cpo-other 37.76% <ø> (+0.03%) ⬆️
hypershift-operator 47.95% <59.57%> (+0.01%) ⬆️
other 27.77% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch 2 times, most recently from 668e301 to aa074db Compare April 22, 2026 13:11
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 231-245: The Role may be left orphaned because the cached List can
still return the just-deleted RoleBinding; update the loop that scans
roleBindings.Items (after DeleteIfNeeded) to ignore entries that are being
deleted or are the exact binding we just removed: skip items with non-nil
DeletionTimestamp and skip items whose ObjectMeta.Name equals
fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace); also tighten
the match to require roleBindings.Items[i].RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName so you only return early for real, active
RoleBindings. Ensure these checks are applied where roleBindings is iterated
before deciding not to delete the Role.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: ca68a83e-fe57-41cf-983f-f50ed82118a8

📥 Commits

Reviewing files that changed from the base of the PR and between 051ef9f and 668e301.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)

234-245: ⚠️ Potential issue | 🟡 Minor

Skip stale/deleting RoleBindings before preserving the shared Role.

Line 239 can still match the RoleBinding just deleted on Line 231 when the cached client has not observed the deletion yet, causing the final cleanup to return early and leave capi-provider-role orphaned. Also include RoleRef.APIGroup so only active RBAC RoleRefs preserve the Role.

Suggested tightening of the cleanup guard
 	for i := range roleBindings.Items {
-		if roleBindings.Items[i].RoleRef.Kind == "Role" && roleBindings.Items[i].RoleRef.Name == CAPIProviderRoleName {
+		roleBinding := &roleBindings.Items[i]
+		if roleBinding.Name == fmt.Sprintf("%s-%s", CredentialsRBACPrefix, controlPlaneNamespace) || roleBinding.DeletionTimestamp != nil {
+			continue
+		}
+		if roleBinding.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
+			roleBinding.RoleRef.Kind == "Role" &&
+			roleBinding.RoleRef.Name == CAPIProviderRoleName {
 			return nil
 		}
 	}

Run this read-only check to confirm whether this path is using a cached manager client in production:

#!/bin/bash
# Description: Inspect controller wiring to see whether DeleteCredentials receives a cached controller-runtime client.
# Expect: If mgr.GetClient() or reconciler Client is passed through, cached List behavior should be assumed.

rg -n -C4 'DeleteCredentials\s*\(|ReconcileCredentials\s*\(|mgr\.GetClient\(\)|GetAPIReader\(\)|client\.New\(' .
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`
around lines 234 - 245, The loop that preserves the shared Role can erroneously
match RoleBindings that are being deleted or from other API groups; update the
check over roleBindings.Items in agent.go to skip items with a non-nil
metadata.DeletionTimestamp and require RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList /
roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks);
leave the hyperutil.DeleteIfNeeded call for the Role (constructed with
CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the
role is only preserved by active, same-APIGroup RoleBindings.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 234-245: The loop that preserves the shared Role can erroneously
match RoleBindings that are being deleted or from other API groups; update the
check over roleBindings.Items in agent.go to skip items with a non-nil
metadata.DeletionTimestamp and require RoleRef.APIGroup ==
"rbac.authorization.k8s.io" in addition to RoleRef.Kind == "Role" and
RoleRef.Name == CAPIProviderRoleName (so the guard uses RoleBindingList /
roleBindings.Items[i].ObjectMeta.DeletionTimestamp and RoleRef.APIGroup checks);
leave the hyperutil.DeleteIfNeeded call for the Role (constructed with
CAPIProviderRoleName and hc.Spec.Platform.Agent.AgentNamespace) unchanged so the
role is only preserved by active, same-APIGroup RoleBindings.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 63d4fe37-2e1d-4837-be44-6aaf3ff1d451

📥 Commits

Reviewing files that changed from the base of the PR and between 668e301 and aa074db.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go (1)

228-254: Deletion guard looks good; consider logging the skip for observability.

The updated DeleteCredentials correctly addresses prior concerns: it skips the just-deleted RoleBinding by name, ignores bindings with a non-nil DeletionTimestamp, and tightens the match with RoleRef.APIGroup == "rbac.authorization.k8s.io" before short-circuiting. This correctly preserves the shared capi-provider-role when another HostedCluster in the same agent namespace still references it.

Two small follow-ups to consider (non-blocking):

  1. When skipping role deletion (Line 248), a debug/info log indicating "role retained — still referenced by RoleBinding X" would help operators diagnose why a Role lingers in the agent namespace after an HC teardown.
  2. The early return nil on Line 248 exits on the first referencing binding, which is correct, but since the iteration order of roleBindings.Items is not deterministic you won't be able to tell which binding held it. Capturing the name in a log as suggested above mitigates this.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`
around lines 228 - 254, Add an observability log before early-returning from
DeleteCredentials to show which RoleBinding prevented role deletion: inside the
loop over roleBindings.Items in DeleteCredentials, when you detect a live
referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the
controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context)
to emit an Info/Debug message like "retaining CAPI provider role; referenced by
RoleBinding <rb.Name>" and then return nil; also consider logging when you skip
the just-deleted binding (deletedBindingName) or when skipping due to
rb.DeletionTimestamp != nil for extra observability.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: Add an observability log before early-returning from
DeleteCredentials to show which RoleBinding prevented role deletion: inside the
loop over roleBindings.Items in DeleteCredentials, when you detect a live
referencing binding (rb.RoleRef.APIGroup == "rbac.authorization.k8s.io" &&
rb.RoleRef.Kind == "Role" && rb.RoleRef.Name == CAPIProviderRoleName) call the
controller logger (e.g., ctrl.LoggerFrom(ctx) or the project logger in context)
to emit an Info/Debug message like "retaining CAPI provider role; referenced by
RoleBinding <rb.Name>" and then return nil; also consider logging when you skip
the just-deleted binding (deletedBindingName) or when skipping due to
rb.DeletionTimestamp != nil for extra observability.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: e1264a9a-e9bb-4549-9298-afdeed2950c2

📥 Commits

Reviewing files that changed from the base of the PR and between aa074db and ceef355.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

@cwilkers cwilkers marked this pull request as ready for review April 22, 2026 15:58
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
@openshift-ci openshift-ci Bot requested review from clebs and jparrill April 22, 2026 15:59
Comment thread hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go (1)

73-106: Solid idempotency coverage; optionally assert rule content, not just length.

The test correctly verifies the second ReconcileCredentials call does not produce an error or duplicate rules. As a small hardening, consider also asserting the rule content on the second read (APIGroups/Resources/Verbs) so a regression that replaces the rule with a different-but-still-length-1 rule would be caught.

♻️ Optional strengthening of idempotency assertion
 	g.Expect(err).ToNot(HaveOccurred())
 	g.Expect(role.Rules).To(HaveLen(1))
+	g.Expect(role.Rules[0].APIGroups).To(Equal([]string{"agent-install.openshift.io"}))
+	g.Expect(role.Rules[0].Resources).To(Equal([]string{"agents"}))
+	g.Expect(role.Rules[0].Verbs).To(Equal([]string{"*"}))
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go`
around lines 73 - 106, The test
TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently
only asserts role.Rules length; after fetching the Role (variable role from
client.Get using CAPIProviderRoleName) add assertions that the single rule's
fields match the expected APIGroups, Resources and Verbs (e.g. check
role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal
the canonical slices expected by ReconcileCredentials) so a
replacement-with-different-rule regression is caught; keep these assertions
after the second ReconcileCredentials call to validate idempotent content as
well as count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go`:
- Around line 228-254: The DeleteCredentials function has a race where a
RoleBinding can be created after the cached List completes but before we
DeleteIfNeeded the Role; to fix, before calling hyperutil.DeleteIfNeeded for the
Role (in DeleteCredentials), perform either a second uncached server-side List
(use c.List with client.InNamespace and client.DirectClient or use
client.Options with metav1.ListOptions) to re-check for any live RoleBindings
referencing CAPIProviderRoleName (skip deletedBindingName and DeletionTimestamp
as already done), or add and use a deterministic label on Role/RoleBinding in
ReconcileCredentials and then List using that label selector server-side so only
relevant bindings are considered; if the second check finds any matching
RoleBindings return nil, otherwise proceed to delete via
hyperutil.DeleteIfNeeded.

---

Nitpick comments:
In
`@hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go`:
- Around line 73-106: The test
TestReconcileCredentials_WhenCalledMultipleTimes_ItShouldBeIdempotent currently
only asserts role.Rules length; after fetching the Role (variable role from
client.Get using CAPIProviderRoleName) add assertions that the single rule's
fields match the expected APIGroups, Resources and Verbs (e.g. check
role.Rules[0].APIGroups, role.Rules[0].Resources and role.Rules[0].Verbs equal
the canonical slices expected by ReconcileCredentials) so a
replacement-with-different-rule regression is caught; keep these assertions
after the second ReconcileCredentials call to validate idempotent content as
well as count.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Pro Plus

Run ID: 8f0b5551-1ad6-45ef-b804-379415b3ac99

📥 Commits

Reviewing files that changed from the base of the PR and between ceef355 and 01425d8.

⛔ Files ignored due to path filters (1)
  • cmd/cluster/agent/testdata/zz_fixture_TestCreateCluster_minimal_flags_necessary_to_render.yaml is excluded by !**/testdata/**
📒 Files selected for processing (3)
  • cmd/cluster/agent/create.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent_test.go

@jparrill
Copy link
Copy Markdown
Contributor

jparrill commented May 4, 2026

Thanks for working on this @cwilkers — I understand the motivation for ZTP/gitops workflows. I have a few concerns about the current approach that I'd like to discuss before we settle on the right solution:

  1. Role reconciliation on every loop iteration

Moving the Role creation into ReconcileCredentials() means every reconcile of every Agent-platform HostedCluster will issue at minimum a GET against the API server for a Role whose content isstatic and never changes. Before this PR, the Role was created once by the CLI and left alone. The self-healing benefit is real, but the cost of reconciling a static resource on every iterationseems disproportionate — especially at scale with many Agent-platform HCs.

Would it make sense to either (a) check for existence before calling createOrUpdate, or (b) move the Role creation to a one-time setup in the operator startup (e.g., alongside other platformbootstrapping)?

  1. Race condition in shared agent namespace teardown

When two HostedClusters share the same agentNamespace and are deleted concurrently, the following race can occur:

  1. HC1 and HC2 both enter DeleteCredentials
  2. Both delete their respective RoleBindings
  3. Both list remaining RoleBindings — each sees zero (or only the other's with a DeletionTimestamp)
  4. Both decide to delete the shared capi-provider-role 5. While HC1's CAPI provider is still running its Cluster finalizer cleanup (which needs access to agents resources), HC2's DeleteCredentials deletes the Role

The double-delete itself is safe (DeleteIfNeeded is idempotent), but the problem is the Role disappearing while a CAPI provider still needs it to release agents during teardown. The deletion order
in delete() (CAPI Cluster deleted → wait for completion → DeleteCredentials) protects the single-HC case, but not the concurrent multi-HC case.1. Role reconciliation on every loop

  1. Some Minos items:
  • The List in DeleteCredentials fetches all RoleBindings in the agent namespace (a user-controlled namespace that could have many unrelated bindings). Consider adding a HyperShift label to managedRoleBindings during reconciliation and filtering by that label during deletion.
  • The Role grants Verbs: []string{"*"} on agents — I know this is unchanged from the CLI version, but since we're moving it to the operator this would be a good opportunity to scope it down to the minimum required set (get, list, watch, update, patch).
  • Consider using rbacv1.GroupName instead of the hardcoded "rbac.authorization.k8s.io" string.

I see your point on the ZTP use case but concerns 1 and 2 would need to be addressed before this can go forward. Let me know your toughts.

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 4, 2026

With some planning help by Claude, I can move the Role creation out of the ReconcileCredentials function where it doesn't make that much sense, and into the reconcileCAPIProvider function where it can be gated to only create if the capi deployment does not exist yet. I'll need to test.

For the deletion race condition, would it be an option to simply not have the operator delete the Role? It's not as clean, but a NS deletion will clear it and it doesn't really grant access to anything we would worry about. The other option suggested by AI would be to add multiple finalizers to the Role, which seems clunky to me.

@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch from 01425d8 to c2e85ea Compare May 6, 2026 14:41
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 6, 2026
Supports work in ZTP deployment of HCP clusters by taking Role
generation (which is a security risk in gitops workflows) out of the CLI
and into the operator. Now the operator will handle the role creation,
and gitops based deployments need not be unconstrained to be able to
create Roles.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: cwilkers <cwilkers@redhat.com>
@cwilkers cwilkers force-pushed the CNTRLPLANE-2539 branch from c2e85ea to edbb80e Compare May 8, 2026 11:27
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 8, 2026
@clebs
Copy link
Copy Markdown
Member

clebs commented May 8, 2026

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label May 8, 2026
@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aks-4-22
/test e2e-aws-4-22
/test e2e-aks
/test e2e-aws
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 8, 2026

/test e2e-aws

@hypershift-jira-solve-ci
Copy link
Copy Markdown

hypershift-jira-solve-ci Bot commented May 8, 2026

The PR changes are explicitly gated behind hcluster.Spec.Platform.Type == hyperv1.AgentPlatform — these e2e tests run on AWS platform. The failures are completely unrelated to the PR. Now I have everything I need for the report.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-e2e-aws-4-22
  • Build ID: 2052717127420350464
  • Target: e2e-aws-4-22
  • PR: #8305CNTRLPLANE-2539: Move generation of the CAPI Provider Role
  • Result: 517 tests, 28 skipped, 5 failures (2 distinct root causes)

Test Failure Analysis

Error

1) TestKarpenter/Main/AutoNode_enable/disable_lifecycle (306.07s)
   Failed to wait for HostedCluster to have AutoNodeEnabled=True/AsExpected in 5m0s: context deadline exceeded
   Got: AutoNodeEnabled=False: AutoNodeProgressing(AutoNode is being enabled: karpenter: Waiting for deployment karpenter rollout to finish: 1 out of 1 new replicas have been updated)

2) TestCreateCluster/Teardown (1356.19s)
   Failed to wait for infra resources in guest cluster to be deleted: context deadline exceeded
   Failed to clean up 4 remaining resources for guest cluster (1 NLB, 3 EBS volumes)

Summary

Both failures are unrelated to the PR changes and are pre-existing flaky test issues. PR #8305 modifies CAPI Provider Role generation exclusively for the Agent platform (all new code is gated behind hcluster.Spec.Platform.Type == hyperv1.AgentPlatform), while these e2e tests run on the AWS platform. The PR's code paths are never exercised by these tests. The two distinct failures are: (1) a Karpenter deployment rollout timeout during the AutoNode enable/disable lifecycle test, where re-enabling Karpenter after disabling it gets stuck in AutoNodeProgressing state because the karpenter deployment never completes its rollout within the 5-minute timeout; and (2) an AWS resource cleanup timeout during TestCreateCluster teardown, where 4 cloud resources (1 NLB load balancer and 3 EBS volumes) failed to be deleted within the deadline. All functional test phases (TestCreateCluster/Main, TestCreateCluster/EnsureHostedCluster, TestKarpenter/Main/Parallel_provisioning_tests, etc.) passed successfully.

Root Cause

Failure 1: TestKarpenter/Main/AutoNode_enable/disable_lifecycle

The test disables AutoNode (Karpenter) on the HostedCluster, verifies it reaches AutoNodeEnabled=False/AutoNodeNotConfigured (succeeds in 3s), then re-enables it. After re-enabling, the test waits for AutoNodeEnabled=True/AsExpected but the condition remains stuck at AutoNodeEnabled=False: AutoNodeProgressing(AutoNode is being enabled: karpenter: Waiting for deployment karpenter rollout to finish: 1 out of 1 new replicas have been updated) for the entire 5-minute timeout. The karpenter deployment reported "1 out of 1 new replicas have been updated" but never transitioned to available — this is a classic Kubernetes deployment rollout stall where the new pod is created but doesn't reach Ready state within the progress deadline. All other HostedCluster conditions are healthy (API server, etcd, infrastructure all available), isolating the issue to the karpenter deployment itself. This is a known intermittent issue with the Karpenter enable/disable lifecycle test.

Failure 2: TestCreateCluster/Teardown

The TestCreateCluster/Main phase and all EnsureHostedCluster validation checks passed successfully. The failure occurred only during teardown when attempting to destroy the hosted cluster and clean up AWS infrastructure. The HostedCluster finalizer didn't complete within the timeout ("hostedcluster wasn't finalized, aborting delete: context deadline exceeded"), leaving 4 orphaned AWS resources: 1 NLB (router-default) and 3 EBS volumes (one per AZ for the HA node machines). This is a teardown-only flake caused by slow AWS resource deletion, not a functional test failure.

Neither failure is caused by PR #8305. The PR modifies:

  • cmd/cluster/agent/create.go — Removes client-side CAPI provider Role creation for Agent platform
  • hypershift-operator/controllers/hostedcluster/hostedcluster_controller.go — Adds server-side Role reconciliation, gated behind hcluster.Spec.Platform.Type == hyperv1.AgentPlatform
  • hypershift-operator/controllers/hostedcluster/internal/platform/agent/agent.go — New ReconcileCAPIProviderRole function and enhanced DeleteCredentials

All changes are Agent-platform-specific code paths that are never executed during AWS e2e tests.

Recommendations
  1. Re-trigger the job — These are known intermittent failures unrelated to the PR. Running /test e2e-aws-4-22 should pass on retry.
  2. No code changes needed in PR CNTRLPLANE-2539: Move generation of the CAPI Provider Role #8305 — The PR's Agent platform CAPI Provider Role changes are completely isolated from the AWS test failures.
  3. For the Karpenter flake — The AutoNode enable/disable lifecycle test's 5-minute timeout may be insufficient for the karpenter deployment rollout to complete after a disable/re-enable cycle. Consider increasing the timeout or investigating why the karpenter pod fails to reach Ready after re-enablement.
  4. For the TestCreateCluster/Teardown flake — The AWS resource cleanup timeout during cluster destruction is a known infrastructure flake. The functional test itself (Main + EnsureHostedCluster) passed fully.
Evidence
Evidence Detail
PR scope Changes only Agent platform CAPI Provider Role — all new code gated behind hcluster.Spec.Platform.Type == hyperv1.AgentPlatform
Test platform e2e tests run on AWS platform, never exercising Agent platform code paths
TestCreateCluster/Main ✅ PASSED (1073.56s)
TestCreateCluster/EnsureHostedCluster ✅ PASSED (3.03s)
TestCreateCluster/Teardown ❌ FAILED — context deadline exceeded during AWS resource cleanup (1 NLB + 3 EBS volumes)
TestKarpenter parallel tests ✅ All PASSED (subnet propagation, capacity reservation, EC2NodeClass, instance profile)
TestKarpenter/Main/AutoNode_enable/disable ❌ FAILED — karpenter deployment rollout stuck at "1 out of 1 new replicas have been updated" for 5m
Karpenter HC conditions All healthy except AutoNodeEnabled=False: AutoNodeProgressing — cluster itself is fully functional
TestUpgradeControlPlane ✅ PASSED (3291.00s)
TestNodePool ✅ PASSED
Overall results 517 tests, 28 skipped, 5 failures (all 5 from just 2 root causes, both teardown/lifecycle flakes)
Files changed by PR agent/create.go, hostedcluster_controller.go, platform/agent/agent.go, agent_test.go, test fixtures

@cwilkers
Copy link
Copy Markdown
Author

cwilkers commented May 8, 2026

/test e2e-aws-4-22

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 8, 2026

@cwilkers: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/cli Indicates the PR includes changes for CLI area/hypershift-operator Indicates the PR includes changes for the hypershift operator and API - outside an OCP release jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants