DNM/SPLAT-2455: Add e2e tests for BYO SG feature on AWS NLBs#452
DNM/SPLAT-2455: Add e2e tests for BYO SG feature on AWS NLBs#452mfbonfigli wants to merge 2 commits intoopenshift:mainfrom
Conversation
Adds end to end tests for BYO security groups for AWS NLBs. These tests are defined here in downstream due to limitations in the upstream test setup. The tests validate the basic flows of provisioning a NLB with a BYO security group and switching back and forth between Managed and BYO security groups.
Uses the AWS SDK logic to detect API operations that fail due to retryable errors, like throttling or network issues.
|
@mfbonfigli: This pull request references SPLAT-2455 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
WalkthroughThese changes introduce E2E test infrastructure for AWS Network Load Balancer security group management, including helpers for EC2 security group lifecycle operations, cluster discovery utilities, and comprehensive test cases for bring-your-own (BYO) security group workflows and managed-to-BYO transitions. Changes
Sequence Diagram(s)sequenceDiagram
participant Test
participant K8s as Kubernetes Service
participant Ctrl as Controller
participant AWS as AWS (NLB/SG)
rect rgba(200, 150, 100, 0.5)
Note over Test,AWS: Test 1: BYO SG Creation & Preservation
Test->>AWS: Create security group in cluster VPC
Test->>AWS: Authorize ingress rules
Test->>K8s: Create LoadBalancer Service with BYO SG annotation
Ctrl->>AWS: Attach BYO SG to NLB
Test->>AWS: Verify BYO SG attached, not controller-managed
Test->>K8s: Delete Service
Ctrl->>AWS: Delete NLB
Test->>AWS: Verify BYO SG still exists, unmanaged
end
rect rgba(100, 150, 200, 0.5)
Note over Test,AWS: Test 2: Managed ↔ BYO ↔ Managed Round-trip
Test->>K8s: Create LoadBalancer Service (no BYO annotation)
Ctrl->>AWS: Create managed SGs, attach to NLB
Test->>AWS: Verify managed SGs attached
Test->>AWS: Create BYO SG with ingress rules
Test->>K8s: Update Service with BYO SG annotation
Ctrl->>AWS: Attach BYO SG to NLB
Ctrl->>AWS: Delete previous managed SGs
Test->>AWS: Verify BYO SG attached, managed SGs deleted
Test->>K8s: Remove BYO SG annotation
Ctrl->>AWS: Create new managed SGs, attach to NLB
Test->>AWS: Verify managed SGs attached, BYO SG detached but preserved
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Important Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional. ❌ Failed checks (2 warnings, 2 inconclusive)
✅ Passed checks (8 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/testwith openshift/cloud-provider-aws/main/e2e-aws-ovn-techpreview openshift/cloud-provider-aws#147 openshift/installer#10512 |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (5)
cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go (3)
408-466: Consolidate duplicatedDescribeInstancescall acrossgetClusterVPCIDandgetClusterName.Both helpers independently call
getClusterInstanceID(which lists all cluster nodes) and thenDescribeInstanceson the same instance. In the BYO SG test each round-trip triggers 4 API calls just for discovery. Consider a singlegetClusterMetadatathat returns(instanceID, vpcID, clusterName)in one shot, or cache the instance lookup.♻️ Sketch
func getClusterMetadata(ctx context.Context, cs clientset.Interface, ec2Client *ec2.Client) (vpcID, clusterName string, err error) { instanceID, err := getClusterInstanceID(ctx, cs) if err != nil { return "", "", err } result, err := ec2Client.DescribeInstances(ctx, &ec2.DescribeInstancesInput{ InstanceIds: []string{instanceID}, }) if err != nil { return "", "", fmt.Errorf("failed to describe instance %s: %v", instanceID, err) } if len(result.Reservations) == 0 || len(result.Reservations[0].Instances) == 0 { return "", "", fmt.Errorf("instance %s not found", instanceID) } inst := result.Reservations[0].Instances[0] vpcID = aws.ToString(inst.VpcId) for _, tag := range inst.Tags { if name, ok := strings.CutPrefix(aws.ToString(tag.Key), "kubernetes.io/cluster/"); ok { clusterName = name break } } if vpcID == "" || clusterName == "" { return "", "", fmt.Errorf("missing VPC ID or cluster tag on instance %s", instanceID) } return vpcID, clusterName, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 408 - 466, Both getClusterVPCID and getClusterName duplicate calls to getClusterInstanceID and ec2Client.DescribeInstances causing unnecessary API round-trips; consolidate by implementing a single helper (e.g., getClusterMetadata) that calls getClusterInstanceID once, calls DescribeInstances once, extracts and returns instanceID, VPC ID and cluster name (or returns both vpcID and clusterName) and update getClusterVPCID and getClusterName to call this new helper (or add simple caching of the DescribeInstances result keyed by instanceID) so the code uses one DescribeInstances call per discovery.
344-369: Dead branch:ec2IsNotFoundErrorcheck afterdeleteAWSSecurityGroup.
deleteAWSSecurityGroupalready swallows not-found errors and returnsnil(lines 311-314), so theerr != nil && ec2IsNotFoundError(err)branch at lines 362-365 is unreachable. Harmless, but can be removed to reduce confusion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 344 - 369, The ec2IsNotFoundError check after calling deleteAWSSecurityGroup is unreachable because deleteAWSSecurityGroup already swallows not-found errors and returns nil; remove the entire ec2IsNotFoundError branch (the strings "security group %s deleted" log and its return) from the error-handling block in the polling closure, leaving the isAWSRetryableError and dependency-violation branches and the final return of other errors intact, and rely on the normal (err == nil) path to handle successful deletion logging elsewhere in the code.
436-466:getClusterNamereturns the first matching prefix tag without validating the value.The function matches any tag whose key starts with
kubernetes.io/cluster/, regardless of whether the value isownedorshared. In typical OpenShift clusters this is fine, but if an instance happens to carry a stale tag for another cluster (e.g., in shared-VPC setups), the discovery will pick an arbitrary one. Consider filtering on value== "owned"to be consistent withisSecurityGroupManagedat line 253.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go` around lines 436 - 466, getClusterName currently returns the first tag whose key has the "kubernetes.io/cluster/" prefix without validating the tag value; change the loop that iterates result.Reservations[0].Instances[0].Tags in getClusterName to also check aws.ToString(tag.Value) == "owned" (same semantics used by isSecurityGroupManaged) before accepting the cluster name parsed via strings.CutPrefix, and continue searching if the value is not "owned"; return the same error if no owned cluster tag is found.cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go (2)
541-546: Security-group name may exceed AWS length limit / include invalid chars depending on namespace.
e2e-nlb-byo-sg-{create,update}-<ns.Name>is fine in typical e2e runs (namespace patterncloud-provider-aws-XXXXX), but AWS SG names are limited to 255 chars and must match[a-zA-Z0-9 ._\-:/()#,@\[\]+=&;{}!$*]. This is unlikely to bite today, but if someone ever passes a longer suffix this will fail with a non-obviousInvalidGroup.Malformed. Worth a short comment or astringsutil.ShortenNamecall for defense.Also applies to: 726-730
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines 541 - 546, The SG name built with fmt.Sprintf("e2e-nlb-byo-sg-create-%s", ns.Name) can exceed AWS length limits or contain invalid characters; update the code that constructs sgName (used before calling createAWSSecurityGroup) to first sanitize and shorten ns.Name (e.g. remove/replace invalid characters to the allowed set and use a helper like stringsutil.ShortenName or implement truncation) so the final sgName length stays <=255 and only contains valid characters, then call createAWSSecurityGroup with the sanitized/shortened name; apply the same fix to the analogous sgName construction at the other location noted (create/update).
758-784: RedundantpollCtx.Done()select insidePollUntilContextTimeout.
wait.PollUntilContextTimeoutalready honors context cancellation between polls and passes a canceledpollCtxto the condition function; the explicitselect { case <-pollCtx.Done(): return false, pollCtx.Err(); default: }blocks add no safety and just clutter the polling closures. Consider dropping them to match the style used elsewhere in this file (e.g., lines 409-425).Also applies to: 832-879
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines 758 - 784, Remove the redundant manual context-checking select inside the PollUntilContextTimeout closure: in the wait.PollUntilContextTimeout(...) callback that calls getAWSLoadBalancerFromDNSName, delete the select { case <-pollCtx.Done(): return false, pollCtx.Err(); default: } since PollUntilContextTimeout already supplies a cancelled pollCtx and handles cancellation between polls; keep the rest of the logic that checks lb, slices.Contains(lb.SecurityGroups, byoSGID) and sets updatedLB, and apply the same removal to the analogous closure around lines 832-879.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go`:
- Around line 548-554: Add a DeferCleanup that safely deletes the
Service/LoadBalancer after the service is created and before the BYO SG
DeferCleanup so teardown order is service → SG; implement it using
DeferCleanup(func(cleanupCtx context.Context) { err :=
deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS); if err !=
nil && !apierrors.IsNotFound(err) { framework.Logf("warning: service/LB cleanup
failed: %v", err) } }) so the mid-test explicit delete can remain but the
DeferCleanup is a no-op if the service was already removed; ensure this new
DeferCleanup is registered prior to the existing waitForSecurityGroupDeletion
DeferCleanup.
- Around line 702-739: The test registers the BYO SG cleanup after the service
cleanup so LIFO ordering causes the SG deletion to run before the LB is removed,
leaking the SG; fix by creating the BYO SG (call createAWSSecurityGroup,
capturing byoSGID) and registering its DeferCleanup (which calls
waitForSecurityGroupDeletion) before creating the service (createServiceNLB) and
registering the service DeferCleanup that calls
deleteServiceAndWaitForLoadBalancerDeletion with lbDNS; to do this hoist
getClusterVPCID above the service creation, move the BYO SG creation + its
DeferCleanup ahead of where createServiceNLB is called so the service cleanup
runs first at teardown.
- Around line 862-895: The poll currently returns success as soon as any
security group on the load balancer passes isSecurityGroupManaged, then assigns
newManagedSGIDs = lb.SecurityGroups which may include untagged SGs; change the
logic in the polling closure used around isSecurityGroupManaged so it either (A)
collects only the SG IDs that returned managed==true into newManagedSGIDs (i.e.,
append sgID only when isSecurityGroupManaged returns true and set finalLB from
that filtered set) or (B) require all sgIDs in lb.SecurityGroups to return
managed==true before returning success (so the loop only returns true when every
call to isSecurityGroupManaged succeeded); update references to newManagedSGIDs,
finalLB and the poll termination condition accordingly so the later verification
loop only iterates over SGs that were confirmed managed.
---
Nitpick comments:
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.go`:
- Around line 408-466: Both getClusterVPCID and getClusterName duplicate calls
to getClusterInstanceID and ec2Client.DescribeInstances causing unnecessary API
round-trips; consolidate by implementing a single helper (e.g.,
getClusterMetadata) that calls getClusterInstanceID once, calls
DescribeInstances once, extracts and returns instanceID, VPC ID and cluster name
(or returns both vpcID and clusterName) and update getClusterVPCID and
getClusterName to call this new helper (or add simple caching of the
DescribeInstances result keyed by instanceID) so the code uses one
DescribeInstances call per discovery.
- Around line 344-369: The ec2IsNotFoundError check after calling
deleteAWSSecurityGroup is unreachable because deleteAWSSecurityGroup already
swallows not-found errors and returns nil; remove the entire ec2IsNotFoundError
branch (the strings "security group %s deleted" log and its return) from the
error-handling block in the polling closure, leaving the isAWSRetryableError and
dependency-violation branches and the final return of other errors intact, and
rely on the normal (err == nil) path to handle successful deletion logging
elsewhere in the code.
- Around line 436-466: getClusterName currently returns the first tag whose key
has the "kubernetes.io/cluster/" prefix without validating the tag value; change
the loop that iterates result.Reservations[0].Instances[0].Tags in
getClusterName to also check aws.ToString(tag.Value) == "owned" (same semantics
used by isSecurityGroupManaged) before accepting the cluster name parsed via
strings.CutPrefix, and continue searching if the value is not "owned"; return
the same error if no owned cluster tag is found.
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go`:
- Around line 541-546: The SG name built with
fmt.Sprintf("e2e-nlb-byo-sg-create-%s", ns.Name) can exceed AWS length limits or
contain invalid characters; update the code that constructs sgName (used before
calling createAWSSecurityGroup) to first sanitize and shorten ns.Name (e.g.
remove/replace invalid characters to the allowed set and use a helper like
stringsutil.ShortenName or implement truncation) so the final sgName length
stays <=255 and only contains valid characters, then call createAWSSecurityGroup
with the sanitized/shortened name; apply the same fix to the analogous sgName
construction at the other location noted (create/update).
- Around line 758-784: Remove the redundant manual context-checking select
inside the PollUntilContextTimeout closure: in the
wait.PollUntilContextTimeout(...) callback that calls
getAWSLoadBalancerFromDNSName, delete the select { case <-pollCtx.Done(): return
false, pollCtx.Err(); default: } since PollUntilContextTimeout already supplies
a cancelled pollCtx and handles cancellation between polls; keep the rest of the
logic that checks lb, slices.Contains(lb.SecurityGroups, byoSGID) and sets
updatedLB, and apply the same removal to the analogous closure around lines
832-879.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 13fa2dcf-3d2d-41be-8656-fde03685dbe9
📒 Files selected for processing (2)
cmd/cloud-controller-manager-aws-tests-ext/e2e/helper.gocmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go
| DeferCleanup(func(cleanupCtx context.Context) { | ||
| By("cleaning up BYO security group") | ||
| err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute) | ||
| if err != nil { | ||
| framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err) | ||
| } | ||
| }) |
There was a problem hiding this comment.
Missing DeferCleanup for the Service/LB in the BYO-creation test; failure before line 631 leaks the LB.
Test 1 registers a DeferCleanup for the BYO SG (line 548), but deletes the Service inline at line 631 without a DeferCleanup safety net. If any Expect/framework.ExpectNoError between lines 590 and 630 fails, the Service+LB remain and the BYO SG cleanup runs while the LB still holds the SG — waitForSecurityGroupDeletion will retry DependencyViolation until its 5-minute timeout, log a warning, and leave the SG orphaned.
Register a service cleanup via DeferCleanup (see ordering note on Test 2), and keep the mid-test "delete service and verify SG preserved" as an explicit action — the DeferCleanup will be a no-op if deletion already happened.
🐛 Pattern
DeferCleanup(func(cleanupCtx context.Context) {
// tolerate already-deleted service
err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS)
if err != nil && !apierrors.IsNotFound(err) {
framework.Logf("warning: service/LB cleanup failed: %v", err)
}
})Register it after service creation but before the BYO-SG cleanup (see companion comment) so teardown order is service → SG.
Also applies to: 630-633
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
548 - 554, Add a DeferCleanup that safely deletes the Service/LoadBalancer after
the service is created and before the BYO SG DeferCleanup so teardown order is
service → SG; implement it using DeferCleanup(func(cleanupCtx context.Context) {
err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS); if
err != nil && !apierrors.IsNotFound(err) { framework.Logf("warning: service/LB
cleanup failed: %v", err) } }) so the mid-test explicit delete can remain but
the DeferCleanup is a no-op if the service was already removed; ensure this new
DeferCleanup is registered prior to the existing waitForSecurityGroupDeletion
DeferCleanup.
| DeferCleanup(func(cleanupCtx context.Context) { | ||
| err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS) | ||
| framework.ExpectNoError(err, "failed to delete service and wait for load balancer deletion") | ||
| }) | ||
|
|
||
| By("verifying managed security groups are attached initially") | ||
| Expect(len(foundLB.SecurityGroups)).To(BeNumerically(">", 0), | ||
| "load balancer should have managed security groups attached initially") | ||
| initialManagedSGs := foundLB.SecurityGroups | ||
| framework.Logf("initial managed security groups: %v", initialManagedSGs) | ||
|
|
||
| By("verifying initial security groups are managed by the controller") | ||
| for _, sgID := range initialManagedSGs { | ||
| isManaged, err := isSecurityGroupManaged(ctx, ec2Client, sgID, clusterName) | ||
| framework.ExpectNoError(err, "failed to check if security group %s is managed", sgID) | ||
| Expect(isManaged).To(BeTrue(), | ||
| "initial security group %s should be managed by the controller", sgID) | ||
| } | ||
|
|
||
| By("discovering cluster VPC for BYO security group creation") | ||
| vpcID, err := getClusterVPCID(ctx, cs, ec2Client) | ||
| framework.ExpectNoError(err, "failed to get cluster VPC ID") | ||
|
|
||
| By("creating BYO security group for transition") | ||
| sgName := fmt.Sprintf("e2e-nlb-byo-sg-update-%s", ns.Name) | ||
| sgDescription := fmt.Sprintf("BYO Security Group for e2e test %s", ns.Name) | ||
| byoSGID, err := createAWSSecurityGroup(ctx, ec2Client, sgName, sgDescription, vpcID) | ||
| framework.ExpectNoError(err, "failed to create BYO security group") | ||
| framework.Logf("created BYO security group: %s", byoSGID) | ||
|
|
||
| // Add cleanup for BYO security group | ||
| DeferCleanup(func(cleanupCtx context.Context) { | ||
| By("cleaning up BYO security group") | ||
| err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute) | ||
| if err != nil { | ||
| framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err) | ||
| } | ||
| }) |
There was a problem hiding this comment.
DeferCleanup LIFO ordering leaks the BYO security group on teardown.
DeferCleanups run in LIFO order. Service cleanup is registered first (line 702-705) and BYO SG cleanup is registered second (line 733-739), so at teardown the SG deletion runs before the service/LB deletion. While the LB still has the BYO SG attached, waitForSecurityGroupDeletion will spin on DependencyViolation until the 5-minute poll timeout, log a warning, and leak the SG.
In the happy path the test removes the BYO annotation on line 823-825, so the LB detaches the SG before cleanup — but any failure between lines 744 and 817 will take the leaking path.
Move the BYO SG creation (and its DeferCleanup) to before the service is created, so the service's cleanup is registered last and therefore runs first.
🐛 Suggested reordering
// 1) Create BYO SG & register its cleanup FIRST (runs LAST at teardown)
By("creating BYO security group for transition")
byoSGID, err := createAWSSecurityGroup(ctx, ec2Client, sgName, sgDescription, vpcID)
framework.ExpectNoError(err, "failed to create BYO security group")
DeferCleanup(func(cleanupCtx context.Context) {
err := waitForSecurityGroupDeletion(cleanupCtx, ec2Client, byoSGID, 5*time.Minute)
if err != nil {
framework.Logf("warning: failed to delete BYO security group %s: %v", byoSGID, err)
}
})
// 2) Then create the service & register its cleanup (runs FIRST at teardown)
_, jig, err := createServiceNLB(ctx, cs, ns, serviceName, map[int32]int32{80: 8080})
framework.ExpectNoError(err, "failed to create NLB service load balancer")
DeferCleanup(func(cleanupCtx context.Context) {
err := deleteServiceAndWaitForLoadBalancerDeletion(cleanupCtx, jig, lbDNS)
framework.ExpectNoError(err, "failed to delete service and wait for load balancer deletion")
})This requires hoisting vpcID discovery above the service creation as well.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
702 - 739, The test registers the BYO SG cleanup after the service cleanup so
LIFO ordering causes the SG deletion to run before the LB is removed, leaking
the SG; fix by creating the BYO SG (call createAWSSecurityGroup, capturing
byoSGID) and registering its DeferCleanup (which calls
waitForSecurityGroupDeletion) before creating the service (createServiceNLB) and
registering the service DeferCleanup that calls
deleteServiceAndWaitForLoadBalancerDeletion with lbDNS; to do this hoist
getClusterVPCID above the service creation, move the BYO SG creation + its
DeferCleanup ahead of where createServiceNLB is called so the service cleanup
runs first at teardown.
| // Check if the new SG is managed | ||
| for _, sgID := range lb.SecurityGroups { | ||
| managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName) | ||
| if err != nil { | ||
| framework.Logf("error checking if SG %s is managed: %v", sgID, err) | ||
| return false, nil | ||
| } | ||
| if managed { | ||
| finalLB = lb | ||
| newManagedSGIDs = lb.SecurityGroups | ||
| framework.Logf("new managed security groups attached: %v", newManagedSGIDs) | ||
| return true, nil | ||
| } | ||
| } | ||
|
|
||
| framework.Logf("waiting for managed security groups to be created and attached") | ||
| return false, nil | ||
| }) | ||
| framework.ExpectNoError(err, "new managed security group should be created and attached after removing BYO annotation") | ||
| Expect(finalLB).NotTo(BeNil(), "final load balancer should not be nil") | ||
| Expect(len(newManagedSGIDs)).To(BeNumerically(">", 0), "should have new managed security groups attached") | ||
|
|
||
| By("verifying BYO security group is no longer attached after transition to managed") | ||
| hasBYO := slices.Contains(finalLB.SecurityGroups, byoSGID) | ||
| Expect(hasBYO).To(BeFalse(), | ||
| "BYO security group %s should no longer be attached to load balancer after transition", byoSGID) | ||
|
|
||
| By("verifying new managed security groups are controller-owned") | ||
| for _, sgID := range newManagedSGIDs { | ||
| isManaged, err := isSecurityGroupManaged(ctx, ec2Client, sgID, clusterName) | ||
| framework.ExpectNoError(err, "failed to check if security group %s is managed", sgID) | ||
| Expect(isManaged).To(BeTrue(), | ||
| "security group %s should be managed by the controller (have 'owned' tag)", sgID) | ||
| } |
There was a problem hiding this comment.
Potential flake: newManagedSGIDs captures all LB SGs, but the later assertion requires every one to be managed.
The poll exits as soon as any SG on the LB returns isManaged == true (line 869-874), and then assigns newManagedSGIDs = lb.SecurityGroups. The subsequent assertion at line 890-895 requires every SG in that slice to be controller-owned. If the LB briefly carries a second SG that hasn't yet been tagged (or is not tagged at all), the poll would succeed but the outer loop would fail.
Either narrow newManagedSGIDs to only the SGs that passed the managed check, or tighten the poll to require all attached SGs to be managed before returning success.
♻️ Suggested tightening
- // Check if the new SG is managed
- for _, sgID := range lb.SecurityGroups {
- managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName)
- if err != nil {
- framework.Logf("error checking if SG %s is managed: %v", sgID, err)
- return false, nil
- }
- if managed {
- finalLB = lb
- newManagedSGIDs = lb.SecurityGroups
- framework.Logf("new managed security groups attached: %v", newManagedSGIDs)
- return true, nil
- }
- }
+ // Require all attached SGs to be managed before accepting.
+ allManaged := true
+ for _, sgID := range lb.SecurityGroups {
+ managed, err := isSecurityGroupManaged(pollCtx, ec2Client, sgID, clusterName)
+ if err != nil {
+ framework.Logf("error checking if SG %s is managed: %v", sgID, err)
+ return false, nil
+ }
+ if !managed {
+ allManaged = false
+ break
+ }
+ }
+ if allManaged {
+ finalLB = lb
+ newManagedSGIDs = lb.SecurityGroups
+ return true, nil
+ }
+ return false, nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/cloud-controller-manager-aws-tests-ext/e2e/loadbalancer.go` around lines
862 - 895, The poll currently returns success as soon as any security group on
the load balancer passes isSecurityGroupManaged, then assigns newManagedSGIDs =
lb.SecurityGroups which may include untagged SGs; change the logic in the
polling closure used around isSecurityGroupManaged so it either (A) collects
only the SG IDs that returned managed==true into newManagedSGIDs (i.e., append
sgID only when isSecurityGroupManaged returns true and set finalLB from that
filtered set) or (B) require all sgIDs in lb.SecurityGroups to return
managed==true before returning success (so the loop only returns true when every
call to isSecurityGroupManaged succeeded); update references to newManagedSGIDs,
finalLB and the poll termination condition accordingly so the later verification
loop only iterates over SGs that were confirmed managed.
|
@mfbonfigli fyi the OTE test structure slightly change recently, PTAL in #446 to port your changes |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Do Not Merge / WIP
This PR adds e2e tests for the BYO Security Group feature. It depends on merging of the feature on the upstream kubernetes/cloud-provider-aws and on adding IAM Set Security Groups permission to the CCM. The PR is currently just a way to execute e2e tests using the "testwith" command.
Differently from PR #451 this one does not base its changes off of PR #447 which is still a draft and with known issues that prevent the correct execution of tests.
Summary by CodeRabbit