Skip to content

DNM/SPLAT-2455: Add e2e tests for BYO SG feature on AWS NLBs#452

Draft
mfbonfigli wants to merge 2 commits intoopenshift:mainfrom
mfbonfigli:SPLAT-2455_e2e-tests-byo-sg-on-nlb
Draft

DNM/SPLAT-2455: Add e2e tests for BYO SG feature on AWS NLBs#452
mfbonfigli wants to merge 2 commits intoopenshift:mainfrom
mfbonfigli:SPLAT-2455_e2e-tests-byo-sg-on-nlb

Conversation

@mfbonfigli
Copy link
Copy Markdown

@mfbonfigli mfbonfigli commented Apr 22, 2026

Do Not Merge / WIP

This PR adds e2e tests for the BYO Security Group feature. It depends on merging of the feature on the upstream kubernetes/cloud-provider-aws and on adding IAM Set Security Groups permission to the CCM. The PR is currently just a way to execute e2e tests using the "testwith" command.

Differently from PR #451 this one does not base its changes off of PR #447 which is still a draft and with known issues that prevent the correct execution of tests.

Summary by CodeRabbit

  • Tests
    • Expanded E2E test coverage for AWS Network Load Balancers with custom security group management via service annotations.
    • Added comprehensive tests validating bring-your-own security group creation, preservation during service lifecycle, and round-trip switching between managed and custom security groups.
    • Introduced testing helpers for security group operations and cluster resource discovery.

Adds end to end tests for BYO security groups for AWS NLBs. These
tests are defined here in downstream due to limitations in the
upstream test setup. The tests validate the basic flows of provisioning
a NLB with a BYO security group and switching back and forth between
Managed and BYO security groups.
Uses the AWS SDK logic to detect API operations that
fail due to retryable errors, like throttling or network issues.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 22, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 22, 2026

@mfbonfigli: This pull request references SPLAT-2455 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

Do Not Merge / WIP

This PR adds e2e tests for the BYO Security Group feature. It depends on merging of the feature on the upstream kubernetes/cloud-provider-aws and on adding IAM Set Security Groups permission to the CCM. The PR is currently just a way to execute e2e tests using the "testwith" command.

Differently from PR #451 this one does not base its changes off of PR #447 which is still a draft and with known issues that prevent the correct execution of tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 22, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Walkthrough

These changes introduce E2E test infrastructure for AWS Network Load Balancer security group management, including helpers for EC2 security group lifecycle operations, cluster discovery utilities, and comprehensive test cases for bring-your-own (BYO) security group workflows and managed-to-BYO transitions.

Changes

Cohort / File(s) Summary
EC2 Security Group & Cluster Discovery Helpers
cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go
Added AWS SDK-based retry classification (isAWSRetryableError), EC2 security group lifecycle helpers (createAWSSecurityGroup, authorizeSecurityGroupIngress, deleteAWSSecurityGroup), management detection (isSecurityGroupManaged), polling-based deletion with retry logic (waitForSecurityGroupDeletion), and cluster discovery utilities (getClusterInstanceID, getClusterVPCID, getClusterName).
NLB BYO Security Group Test Coverage
cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go
Added new annotation constant annotationLBSecurityGroups and two E2E test scenarios: (1) BYO SG creation with preservation across service deletion, and (2) managed-to-BYO-to-managed security group round-trip transitions with NLB attachment verification.

Sequence Diagram(s)

sequenceDiagram
    participant Test
    participant K8s as Kubernetes Service
    participant Ctrl as Controller
    participant AWS as AWS (NLB/SG)
    
    rect rgba(200, 150, 100, 0.5)
        Note over Test,AWS: Test 1: BYO SG Creation & Preservation
        Test->>AWS: Create security group in cluster VPC
        Test->>AWS: Authorize ingress rules
        Test->>K8s: Create LoadBalancer Service with BYO SG annotation
        Ctrl->>AWS: Attach BYO SG to NLB
        Test->>AWS: Verify BYO SG attached, not controller-managed
        Test->>K8s: Delete Service
        Ctrl->>AWS: Delete NLB
        Test->>AWS: Verify BYO SG still exists, unmanaged
    end
    
    rect rgba(100, 150, 200, 0.5)
        Note over Test,AWS: Test 2: Managed ↔ BYO ↔ Managed Round-trip
        Test->>K8s: Create LoadBalancer Service (no BYO annotation)
        Ctrl->>AWS: Create managed SGs, attach to NLB
        Test->>AWS: Verify managed SGs attached
        Test->>AWS: Create BYO SG with ingress rules
        Test->>K8s: Update Service with BYO SG annotation
        Ctrl->>AWS: Attach BYO SG to NLB
        Ctrl->>AWS: Delete previous managed SGs
        Test->>AWS: Verify BYO SG attached, managed SGs deleted
        Test->>K8s: Remove BYO SG annotation
        Ctrl->>AWS: Create new managed SGs, attach to NLB
        Test->>AWS: Verify managed SGs attached, BYO SG detached but preserved
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (2 warnings, 2 inconclusive)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning Test 1 lacks DeferCleanup for Service/LoadBalancer causing resource leak; Test 2 has incorrect LIFO ordering and race condition in polling logic with managed SG assertion. Register DeferCleanup for Test 1 Service immediately after creation; reorder Test 2 resource cleanup to run BYO SG cleanup last; fix polling logic to narrow newManagedSGIDs or tighten poll condition.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning New e2e tests contain hardcoded IPv4 CIDR assumption in authorizeSecurityGroupIngress helper function using only 0.0.0.0/0, lacking IPv6 equivalent ::/0 support. Update authorizeSecurityGroupIngress function to support both IPv4 and IPv6 by adding Ipv6Ranges with ::/0 alongside IpRanges, or make CIDR selection dynamic based on cluster IP family configuration.
Microshift Test Compatibility ❓ Inconclusive MicroShift test compatibility cannot be assessed without access to the specified file. Provide the contents of cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go or specify the repository for retrieval.
Ote Binary Stdout Contract ❓ Inconclusive Unable to access cloud-controller-manager-aws-tests-ext test directory or files. External code repository access is not available. Inspect the test files manually or use a local repository clone to examine main(), TestMain(), BeforeSuite() and init() functions for stdout writes.
✅ Passed checks (8 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly references the main objective: adding e2e tests for the BYO (Bring Your Own) Security Group feature on AWS NLBs, which aligns with the changeset that adds two new tests and supporting helper functions for this feature.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed Both new test names are static strings with no dynamic content, timestamps, IDs, or variable interpolation that could change between runs.
Single Node Openshift (Sno) Test Compatibility ✅ Passed The new e2e tests do not make multi-node or HA cluster assumptions. Both tests operate on LoadBalancer services with AWS NLBs and test security group lifecycle, which work identically on SNO.
Topology-Aware Scheduling Compatibility ✅ Passed E2E test code for BYO security group feature on AWS NLBs contains no topology-specific scheduling constraints that would break on supported OpenShift topologies.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 22, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign theobarberbany for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mfbonfigli
Copy link
Copy Markdown
Author

/testwith openshift/cloud-provider-aws/main/e2e-aws-ovn-techpreview openshift/cloud-provider-aws#147 openshift/installer#10512

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (5)
cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go (3)

408-466: Consolidate duplicated DescribeInstances call across getClusterVPCID and getClusterName.

Both helpers independently call getClusterInstanceID (which lists all cluster nodes) and then DescribeInstances on the same instance. In the BYO SG test each round-trip triggers 4 API calls just for discovery. Consider a single getClusterMetadata that returns (instanceID, vpcID, clusterName) in one shot, or cache the instance lookup.

♻️ Sketch
func getClusterMetadata(ctx context.Context, cs clientset.Interface, ec2Client *ec2.Client) (vpcID, clusterName string, err error) {
    instanceID, err := getClusterInstanceID(ctx, cs)
    if err != nil {
        return "", "", err
    }
    result, err := ec2Client.DescribeInstances(ctx, &ec2.DescribeInstancesInput{
        InstanceIds: []string{instanceID},
    })
    if err != nil {
        return "", "", fmt.Errorf("failed to describe instance %s: %v", instanceID, err)
    }
    if len(result.Reservations) == 0 || len(result.Reservations[0].Instances) == 0 {
        return "", "", fmt.Errorf("instance %s not found", instanceID)
    }
    inst := result.Reservations[0].Instances[0]
    vpcID = aws.ToString(inst.VpcId)
    for _, tag := range inst.Tags {
        if name, ok := strings.CutPrefix(aws.ToString(tag.Key), "kubernetes.io/cluster/"); ok {
            clusterName = name
            break
        }
    }
    if vpcID == "" || clusterName == "" {
        return "", "", fmt.Errorf("missing VPC ID or cluster tag on instance %s", instanceID)
    }
    return vpcID, clusterName, nil
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 408 -
466, Both getClusterVPCID and getClusterName duplicate calls to
getClusterInstanceID and ec2Client.DescribeInstances causing unnecessary API
round-trips; consolidate by implementing a single helper (e.g.,
getClusterMetadata) that calls getClusterInstanceID once, calls
DescribeInstances once, extracts and returns instanceID, VPC ID and cluster name
(or returns both vpcID and clusterName) and update getClusterVPCID and
getClusterName to call this new helper (or add simple caching of the
DescribeInstances result keyed by instanceID) so the code uses one
DescribeInstances call per discovery.

344-369: Dead branch: ec2IsNotFoundError check after deleteAWSSecurityGroup.

deleteAWSSecurityGroup already swallows not-found errors and returns nil (lines 311-314), so the err != nil && ec2IsNotFoundError(err) branch at lines 362-365 is unreachable. Harmless, but can be removed to reduce confusion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 344 -
369, The ec2IsNotFoundError check after calling deleteAWSSecurityGroup is
unreachable because deleteAWSSecurityGroup already swallows not-found errors and
returns nil; remove the entire ec2IsNotFoundError branch (the strings "security
group %s deleted" log and its return) from the error-handling block in the
polling closure, leaving the isAWSRetryableError and dependency-violation
branches and the final return of other errors intact, and rely on the normal
(err == nil) path to handle successful deletion logging elsewhere in the code.

436-466: getClusterName returns the first matching prefix tag without validating the value.

The function matches any tag whose key starts with kubernetes.io/cluster/, regardless of whether the value is owned or shared. In typical OpenShift clusters this is fine, but if an instance happens to carry a stale tag for another cluster (e.g., in shared-VPC setups), the discovery will pick an arbitrary one. Consider filtering on value == "owned" to be consistent with isSecurityGroupManaged at line 253.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 436 -
466, getClusterName currently returns the first tag whose key has the
"kubernetes.io/cluster/" prefix without validating the tag value; change the
loop that iterates result.Reservations[0].Instances[0].Tags in getClusterName to
also check aws.ToString(tag.Value) == "owned" (same semantics used by
isSecurityGroupManaged) before accepting the cluster name parsed via
strings.CutPrefix, and continue searching if the value is not "owned"; return
the same error if no owned cluster tag is found.
cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go (2)

541-546: Security-group name may exceed AWS length limit / include invalid chars depending on namespace.

e2e-nlb-byo-sg-{create,update}-<ns.Name> is fine in typical e2e runs (namespace pattern cloud-provider-aws-XXXXX), but AWS SG names are limited to 255 chars and must match [a-zA-Z0-9 ._\-:/()#,@\[\]+=&;{}!$*]. This is unlikely to bite today, but if someone ever passes a longer suffix this will fail with a non-obvious InvalidGroup.Malformed. Worth a short comment or a stringsutil.ShortenName call for defense.

Also applies to: 726-730

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
541 - 546, The SG name built with fmt.Sprintf("e2e-nlb-byo-sg-create-%s",
ns.Name) can exceed AWS length limits or contain invalid characters; update the
code that constructs sgName (used before calling createAWSSecurityGroup) to
first sanitize and shorten ns.Name (e.g. remove/replace invalid characters to
the allowed set and use a helper like stringsutil.ShortenName or implement
truncation) so the final sgName length stays <=255 and only contains valid
characters, then call createAWSSecurityGroup with the sanitized/shortened name;
apply the same fix to the analogous sgName construction at the other location
noted (create/update).

758-784: Redundant pollCtx.Done() select inside PollUntilContextTimeout.

wait.PollUntilContextTimeout already honors context cancellation between polls and passes a canceled pollCtx to the condition function; the explicit select { case <-pollCtx.Done(): return false, pollCtx.Err(); default: } blocks add no safety and just clutter the polling closures. Consider dropping them to match the style used elsewhere in this file (e.g., lines 409-425).

Also applies to: 832-879

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
758 - 784, Remove the redundant manual context-checking select inside the
PollUntilContextTimeout closure: in the wait.PollUntilContextTimeout(...)
callback that calls getAWSLoadBalancerFromDNSName, delete the select { case
<-pollCtx.Done(): return false, pollCtx.Err(); default: } since
PollUntilContextTimeout already supplies a cancelled pollCtx and handles
cancellation between polls; keep the rest of the logic that checks lb,
slices.Contains(lb.SecurityGroups, byoSGID) and sets updatedLB, and apply the
same removal to the analogous closure around lines 832-879.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go`:
- Around line 548-554: Add a DeferCleanup that safely deletes the
Service/LoadBalancer after the service is created and before the BYO SG
DeferCleanup so teardown order is service → SG; implement it using
DeferCleanup(func(cleanupCtx context.Context) { err :=
deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS); if err !=
nil && !apierrors.IsNotFound(err) { framework.Logf("warning: service/LB cleanup
failed: %v", err) } }) so the mid-test explicit delete can remain but the
DeferCleanup is a no-op if the service was already removed; ensure this new
DeferCleanup is registered prior to the existing waitForSecurityGroupDeletion
DeferCleanup.
- Around line 702-739: The test registers the BYO SG cleanup after the service
cleanup so LIFO ordering causes the SG deletion to run before the LB is removed,
leaking the SG; fix by creating the BYO SG (call createAWSSecurityGroup,
capturing byoSGID) and registering its DeferCleanup (which calls
waitForSecurityGroupDeletion) before creating the service (createServiceNLB) and
registering the service DeferCleanup that calls
deleteServiceAndWaitForLoadBalancerDeletion with lbDNS; to do this hoist
getClusterVPCID above the service creation, move the BYO SG creation + its
DeferCleanup ahead of where createServiceNLB is called so the service cleanup
runs first at teardown.
- Around line 862-895: The poll currently returns success as soon as any
security group on the load balancer passes isSecurityGroupManaged, then assigns
newManagedSGIDs = lb.SecurityGroups which may include untagged SGs; change the
logic in the polling closure used around isSecurityGroupManaged so it either (A)
collects only the SG IDs that returned managed==true into newManagedSGIDs (i.e.,
append sgID only when isSecurityGroupManaged returns true and set finalLB from
that filtered set) or (B) require all sgIDs in lb.SecurityGroups to return
managed==true before returning success (so the loop only returns true when every
call to isSecurityGroupManaged succeeded); update references to newManagedSGIDs,
finalLB and the poll termination condition accordingly so the later verification
loop only iterates over SGs that were confirmed managed.

---

Nitpick comments:
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go`:
- Around line 408-466: Both getClusterVPCID and getClusterName duplicate calls
to getClusterInstanceID and ec2Client.DescribeInstances causing unnecessary API
round-trips; consolidate by implementing a single helper (e.g.,
getClusterMetadata) that calls getClusterInstanceID once, calls
DescribeInstances once, extracts and returns instanceID, VPC ID and cluster name
(or returns both vpcID and clusterName) and update getClusterVPCID and
getClusterName to call this new helper (or add simple caching of the
DescribeInstances result keyed by instanceID) so the code uses one
DescribeInstances call per discovery.
- Around line 344-369: The ec2IsNotFoundError check after calling
deleteAWSSecurityGroup is unreachable because deleteAWSSecurityGroup already
swallows not-found errors and returns nil; remove the entire ec2IsNotFoundError
branch (the strings "security group %s deleted" log and its return) from the
error-handling block in the polling closure, leaving the isAWSRetryableError and
dependency-violation branches and the final return of other errors intact, and
rely on the normal (err == nil) path to handle successful deletion logging
elsewhere in the code.
- Around line 436-466: getClusterName currently returns the first tag whose key
has the "kubernetes.io/cluster/" prefix without validating the tag value; change
the loop that iterates result.Reservations[0].Instances[0].Tags in
getClusterName to also check aws.ToString(tag.Value) == "owned" (same semantics
used by isSecurityGroupManaged) before accepting the cluster name parsed via
strings.CutPrefix, and continue searching if the value is not "owned"; return
the same error if no owned cluster tag is found.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go`:
- Around line 541-546: The SG name built with
fmt.Sprintf("e2e-nlb-byo-sg-create-%s", ns.Name) can exceed AWS length limits or
contain invalid characters; update the code that constructs sgName (used before
calling createAWSSecurityGroup) to first sanitize and shorten ns.Name (e.g.
remove/replace invalid characters to the allowed set and use a helper like
stringsutil.ShortenName or implement truncation) so the final sgName length
stays <=255 and only contains valid characters, then call createAWSSecurityGroup
with the sanitized/shortened name; apply the same fix to the analogous sgName
construction at the other location noted (create/update).
- Around line 758-784: Remove the redundant manual context-checking select
inside the PollUntilContextTimeout closure: in the
wait.PollUntilContextTimeout(...) callback that calls
getAWSLoadBalancerFromDNSName, delete the select { case <-pollCtx.Done(): return
false, pollCtx.Err(); default: } since PollUntilContextTimeout already supplies
a cancelled pollCtx and handles cancellation between polls; keep the rest of the
logic that checks lb, slices.Contains(lb.SecurityGroups, byoSGID) and sets
updatedLB, and apply the same removal to the analogous closure around lines
832-879.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 13fa2dcf-3d2d-41be-8656-fde03685dbe9

📥 Commits

Reviewing files that changed from the base of the PR and between 9240e39 and a2b197d.

📒 Files selected for processing (2)
  • cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go
  • cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go

Comment on lines +548 to +554
DeferCleanup(func(cleanupCtx context.Context) {
By("cleaning up BYO security group")
err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute)
if err != nil {
framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err)
}
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing DeferCleanup for the Service/LB in the BYO-creation test; failure before line 631 leaks the LB.

Test 1 registers a DeferCleanup for the BYO SG (line 548), but deletes the Service inline at line 631 without a DeferCleanup safety net. If any Expect/framework.ExpectNoError between lines 590 and 630 fails, the Service+LB remain and the BYO SG cleanup runs while the LB still holds the SG — waitForSecurityGroupDeletion will retry DependencyViolation until its 5-minute timeout, log a warning, and leave the SG orphaned.

Register a service cleanup via DeferCleanup (see ordering note on Test 2), and keep the mid-test "delete service and verify SG preserved" as an explicit action — the DeferCleanup will be a no-op if deletion already happened.

🐛 Pattern
DeferCleanup(func(cleanupCtx context.Context) {
    // tolerate already-deleted service
    err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS)
    if err != nil && !apierrors.IsNotFound(err) {
        framework.Logf("warning: service/LB cleanup failed: %v", err)
    }
})

Register it after service creation but before the BYO-SG cleanup (see companion comment) so teardown order is service → SG.

Also applies to: 630-633

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
548 - 554, Add a DeferCleanup that safely deletes the Service/LoadBalancer after
the service is created and before the BYO SG DeferCleanup so teardown order is
service → SG; implement it using DeferCleanup(func(cleanupCtx context.Context) {
err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS); if
err != nil && !apierrors.IsNotFound(err) { framework.Logf("warning: service/LB
cleanup failed: %v", err) } }) so the mid-test explicit delete can remain but
the DeferCleanup is a no-op if the service was already removed; ensure this new
DeferCleanup is registered prior to the existing waitForSecurityGroupDeletion
DeferCleanup.

Comment on lines +702 to +739
DeferCleanup(func(cleanupCtx context.Context) {
err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS)
framework.ExpectNoError(err, "failed to delete service and wait for load balancer deletion")
})

By("verifying managed security groups are attached initially")
Expect(len(foundLB.SecurityGroups)).To(BeNumerically(">", 0),
"load balancer should have managed security groups attached initially")
initialManagedSGs := foundLB.SecurityGroups
framework.Logf("initial managed security groups: %v", initialManagedSGs)

By("verifying initial security groups are managed by the controller")
for _, sgID := range initialManagedSGs {
isManaged, err := isSecurityGroupManaged(ctx, ec2Client, sgID, clusterName)
framework.ExpectNoError(err, "failed to check if security group %s is managed", sgID)
Expect(isManaged).To(BeTrue(),
"initial security group %s should be managed by the controller", sgID)
}

By("discovering cluster VPC for BYO security group creation")
vpcID, err := getClusterVPCID(ctx, cs, ec2Client)
framework.ExpectNoError(err, "failed to get cluster VPC ID")

By("creating BYO security group for transition")
sgName := fmt.Sprintf("e2e-nlb-byo-sg-update-%s", ns.Name)
sgDescription := fmt.Sprintf("BYO Security Group for e2e test %s", ns.Name)
byoSGID, err := createAWSSecurityGroup(ctx, ec2Client, sgName, sgDescription, vpcID)
framework.ExpectNoError(err, "failed to create BYO security group")
framework.Logf("created BYO security group: %s", byoSGID)

// Add cleanup for BYO security group
DeferCleanup(func(cleanupCtx context.Context) {
By("cleaning up BYO security group")
err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute)
if err != nil {
framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err)
}
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

DeferCleanup LIFO ordering leaks the BYO security group on teardown.

DeferCleanups run in LIFO order. Service cleanup is registered first (line 702-705) and BYO SG cleanup is registered second (line 733-739), so at teardown the SG deletion runs before the service/LB deletion. While the LB still has the BYO SG attached, waitForSecurityGroupDeletion will spin on DependencyViolation until the 5-minute poll timeout, log a warning, and leak the SG.

In the happy path the test removes the BYO annotation on line 823-825, so the LB detaches the SG before cleanup — but any failure between lines 744 and 817 will take the leaking path.

Move the BYO SG creation (and its DeferCleanup) to before the service is created, so the service's cleanup is registered last and therefore runs first.

🐛 Suggested reordering
// 1) Create BYO SG & register its cleanup FIRST (runs LAST at teardown)
By("creating BYO security group for transition")
byoSGID, err := createAWSSecurityGroup(ctx, ec2Client, sgName, sgDescription, vpcID)
framework.ExpectNoError(err, "failed to create BYO security group")
DeferCleanup(func(cleanupCtx context.Context) {
    err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute)
    if err != nil {
        framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err)
    }
})

// 2) Then create the service & register its cleanup (runs FIRST at teardown)
_, jig, err := createServiceNLB(ctx, cs, ns, serviceName, map[int32]int32{80: 8080})
framework.ExpectNoError(err, "failed to create NLB service load balancer")
DeferCleanup(func(cleanupCtx context.Context) {
    err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS)
    framework.ExpectNoError(err, "failed to delete service and wait for load balancer deletion")
})

This requires hoisting vpcID discovery above the service creation as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
702 - 739, The test registers the BYO SG cleanup after the service cleanup so
LIFO ordering causes the SG deletion to run before the LB is removed, leaking
the SG; fix by creating the BYO SG (call createAWSSecurityGroup, capturing
byoSGID) and registering its DeferCleanup (which calls
waitForSecurityGroupDeletion) before creating the service (createServiceNLB) and
registering the service DeferCleanup that calls
deleteServiceAndWaitForLoadBalancerDeletion with lbDNS; to do this hoist
getClusterVPCID above the service creation, move the BYO SG creation + its
DeferCleanup ahead of where createServiceNLB is called so the service cleanup
runs first at teardown.

Comment on lines +862 to +895
// Check if the new SG is managed
for _, sgID := range lb.SecurityGroups {
managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName)
if err != nil {
framework.Logf("error checking if SG %s is managed: %v", sgID, err)
return false, nil
}
if managed {
finalLB = lb
newManagedSGIDs = lb.SecurityGroups
framework.Logf("new managed security groups attached: %v", newManagedSGIDs)
return true, nil
}
}

framework.Logf("waiting for managed security groups to be created and attached")
return false, nil
})
framework.ExpectNoError(err, "new managed security group should be created and attached after removing BYO annotation")
Expect(finalLB).NotTo(BeNil(), "final load balancer should not be nil")
Expect(len(newManagedSGIDs)).To(BeNumerically(">", 0), "should have new managed security groups attached")

By("verifying BYO security group is no longer attached after transition to managed")
hasBYO := slices.Contains(finalLB.SecurityGroups, byoSGID)
Expect(hasBYO).To(BeFalse(),
"BYO security group %s should no longer be attached to load balancer after transition", byoSGID)

By("verifying new managed security groups are controller-owned")
for _, sgID := range newManagedSGIDs {
isManaged, err := isSecurityGroupManaged(ctx, ec2Client, sgID, clusterName)
framework.ExpectNoError(err, "failed to check if security group %s is managed", sgID)
Expect(isManaged).To(BeTrue(),
"security group %s should be managed by the controller (have 'owned' tag)", sgID)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential flake: newManagedSGIDs captures all LB SGs, but the later assertion requires every one to be managed.

The poll exits as soon as any SG on the LB returns isManaged == true (line 869-874), and then assigns newManagedSGIDs = lb.SecurityGroups. The subsequent assertion at line 890-895 requires every SG in that slice to be controller-owned. If the LB briefly carries a second SG that hasn't yet been tagged (or is not tagged at all), the poll would succeed but the outer loop would fail.

Either narrow newManagedSGIDs to only the SGs that passed the managed check, or tighten the poll to require all attached SGs to be managed before returning success.

♻️ Suggested tightening
-            // Check if the new SG is managed
-            for _, sgID := range lb.SecurityGroups {
-                managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName)
-                if err != nil {
-                    framework.Logf("error checking if SG %s is managed: %v", sgID, err)
-                    return false, nil
-                }
-                if managed {
-                    finalLB = lb
-                    newManagedSGIDs = lb.SecurityGroups
-                    framework.Logf("new managed security groups attached: %v", newManagedSGIDs)
-                    return true, nil
-                }
-            }
+            // Require all attached SGs to be managed before accepting.
+            allManaged := true
+            for _, sgID := range lb.SecurityGroups {
+                managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName)
+                if err != nil {
+                    framework.Logf("error checking if SG %s is managed: %v", sgID, err)
+                    return false, nil
+                }
+                if !managed {
+                    allManaged = false
+                    break
+                }
+            }
+            if allManaged {
+                finalLB = lb
+                newManagedSGIDs = lb.SecurityGroups
+                return true, nil
+            }
+            return false, nil
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
862 - 895, The poll currently returns success as soon as any security group on
the load balancer passes isSecurityGroupManaged, then assigns newManagedSGIDs =
lb.SecurityGroups which may include untagged SGs; change the logic in the
polling closure used around isSecurityGroupManaged so it either (A) collects
only the SG IDs that returned managed==true into newManagedSGIDs (i.e., append
sgID only when isSecurityGroupManaged returns true and set finalLB from that
filtered set) or (B) require all sgIDs in lb.SecurityGroups to return
managed==true before returning success (so the loop only returns true when every
call to isSecurityGroupManaged succeeded); update references to newManagedSGIDs,
finalLB and the poll termination condition accordingly so the later verification
loop only iterates over SGs that were confirmed managed.

@mtulio
Copy link
Copy Markdown
Contributor

mtulio commented Apr 23, 2026

@mfbonfigli fyi the OTE test structure slightly change recently, PTAL in #446 to port your changes

@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 23, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 23, 2026

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants