SPLAT-2715: Fixed broken test and simplified some logic#182
SPLAT-2715: Fixed broken test and simplified some logic#182openshift-merge-bot[bot] merged 2 commits intoopenshift:mainfrom
Conversation
|
@vr4manta: This pull request references SPLAT-2715 which is a valid jira issue. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughReplaced context-level and per-test deferred cleanup in dedicated-host e2e tests with per-test Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-dedicated-serial-techpreview-1of2 |
|
@vr4manta: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/83ed4120-31bd-11f1-8ce9-02ef516c0ad0-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-dedicated-serial-techpreview-1of2 |
|
@vr4manta: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d9c12930-31e6-11f1-8710-50ce8d8e61a0-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-aws-ovn-dedicated-serial-techpreview-2of2 |
|
@vr4manta: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d2767c60-3340-11f1-9415-ca7b3f77fd00-0 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openshift-tests/test/e2e/dedicated_hosts.go`:
- Around line 520-528: The current Eventually block masks all errors by
returning true whenever getDedicatedHostState(ctx, ec2Client, dynamicHostID)
returns an error; update the error handling so you only treat a "host not found"
/ "already deleted" error as a success and propagate or fail on any other
errors: call getDedicatedHostState, if err indicates a not-found (inspect error
type/message), log the not-found case and return true, but for other errors
log/return false (or surface the error via Expect/Fail) so transient
network/perm issues don't make the test pass; keep the state check for state ==
"released" || state == "pending" and reference getDedicatedHostState, ec2Client
and dynamicHostID when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 004ce3d7-c30e-4de0-8e38-613e5be6f829
📒 Files selected for processing (1)
openshift-tests/test/e2e/dedicated_hosts.go
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (1)
openshift-tests/test/e2e/dedicated_hosts.go (1)
521-534:⚠️ Potential issue | 🟠 MajorKeep the
Eventuallypoll retryable for non-not-found AWS errors.
Fail(...)inside the polling callback aborts the check on the first transientDescribeHostserror. That makes this release verification flaky under throttling/network blips instead of retrying until timeout. Log and returnfalsefor retryable errors; only treat confirmed not-found as success.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift-tests/test/e2e/dedicated_hosts.go` around lines 521 - 534, The polling callback passed to Eventually aborts on any non-not-found error because it calls Fail(...) inside the closure; change the logic in the closure that calls getDedicatedHostState(ctx, ec2Client, dynamicHostID) so that only recognized "not found" errors (strings.Contains checks) return true, while all other errors are logged with GinkgoWriter.Printf (including error details and dynamicHostID) and the closure returns false to allow Eventually to retry until timeout; remove the Fail(...) call from this callback so transient DescribeHosts errors don't abort the poll.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openshift-tests/test/e2e/dedicated_hosts.go`:
- Around line 893-920: The code captures nodeName from m.Status.NodeRef only
once before deletion, so if NodeRef is unset the helper skips waiting for node
removal and a late-registering node can leak; update the cleanup in the
dedicated hosts test to, when nodeName is empty, poll the Machine resource
(using client.Machines in machineutil.MachineAPINamespace and the machineName)
after deletion succeeds to detect if/when Status.NodeRef becomes non-nil (or
stop polling if the Machine itself becomes NotFound), set nodeName from
m.Status.NodeRef.Name once observed, and then proceed to call
kubeClient.CoreV1().Nodes().Get in the existing Eventually block to wait for the
node to be removed; ensure you still handle apierrors.IsNotFound for both
machine and node cases.
- Around line 396-397: Register the teardown immediately after a successful
Create instead of only at the end of the spec: right after the machine creation
succeeds, call/attach cleanupMachineAndNode(ctx, kubeConfig, kubeClient,
machineName) via the test framework's cleanup registration (e.g.,
t.Cleanup/DeferCleanup/framework.RegisterCleanupAction) so it will run on test
failures; apply the same pattern for the dedicated-host creation(s) that
currently only get cleaned at the end so both machine/node and dedicated-host
cleanups run even on early Expect failures.
---
Duplicate comments:
In `@openshift-tests/test/e2e/dedicated_hosts.go`:
- Around line 521-534: The polling callback passed to Eventually aborts on any
non-not-found error because it calls Fail(...) inside the closure; change the
logic in the closure that calls getDedicatedHostState(ctx, ec2Client,
dynamicHostID) so that only recognized "not found" errors (strings.Contains
checks) return true, while all other errors are logged with GinkgoWriter.Printf
(including error details and dynamicHostID) and the closure returns false to
allow Eventually to retry until timeout; remove the Fail(...) call from this
callback so transient DescribeHosts errors don't abort the poll.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: fbbfef5b-ac86-47ca-9ee9-ce67486a83cd
📒 Files selected for processing (1)
openshift-tests/test/e2e/dedicated_hosts.go
There was a problem hiding this comment.
♻️ Duplicate comments (1)
openshift-tests/test/e2e/dedicated_hosts.go (1)
907-933:⚠️ Potential issue | 🟠 MajorCleanup can stall before deletion when
NodeRefnever appears.At Line 907, cleanup waits for
Status.NodeRefbefore deleting the Machine. If the node never registers, this times out and Line 930 is never reached, leaving the Machine undeleted and potentially blocking host release checks.Suggested fix
- // If nodeName is empty, poll the machine to detect late-registering nodes - if nodeName == "" { - By(fmt.Sprintf("Node ref not set initially, polling machine %s for late-registering node", machineName)) - Eventually(func() bool { - m, err := client.Machines(machineutil.MachineAPINamespace).Get(ctx, machineName, metav1.GetOptions{}) - if apierrors.IsNotFound(err) { - // Machine is gone, no node registered - return true - } - if err != nil { - // Other error, continue polling - return false - } - if m.Status.NodeRef != nil { - nodeName = m.Status.NodeRef.Name - return true - } - return false - }, defaultTestTimeout, defaultPollingInterval).Should(BeTrue()) - } - // Delete the machine By(fmt.Sprintf("Cleaning up test machine %s", machineName)) err = client.Machines(machineutil.MachineAPINamespace).Delete(ctx, machineName, metav1.DeleteOptions{}) if err != nil && !apierrors.IsNotFound(err) { Expect(err).NotTo(HaveOccurred()) } - // Wait for machine to be deleted + // Wait for machine deletion and capture NodeRef if it appears before final removal. By(fmt.Sprintf("Waiting for machine %s to be deleted", machineName)) Eventually(func() bool { - _, err := client.Machines(machineutil.MachineAPINamespace).Get(ctx, machineName, metav1.GetOptions{}) - return apierrors.IsNotFound(err) + m, err := client.Machines(machineutil.MachineAPINamespace).Get(ctx, machineName, metav1.GetOptions{}) + if apierrors.IsNotFound(err) { + return true + } + if err != nil { + return false + } + if nodeName == "" && m.Status.NodeRef != nil { + nodeName = m.Status.NodeRef.Name + } + return false }, defaultTestTimeout, defaultPollingInterval).Should(BeTrue())🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@openshift-tests/test/e2e/dedicated_hosts.go` around lines 907 - 933, The current polling loop that waits for m.Status.NodeRef (when nodeName == "" in the block using client.Machines(...).Get and Eventually) can hang the cleanup if the node never registers; change the logic so the test does not block deletion: perform a bounded/non-blocking poll for NodeRef (use a short timeout or context) and then proceed to call client.Machines(...).Delete for machineName regardless of whether nodeName was discovered, ensuring the Delete is always attempted even if m.Status.NodeRef remained nil.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@openshift-tests/test/e2e/dedicated_hosts.go`:
- Around line 907-933: The current polling loop that waits for m.Status.NodeRef
(when nodeName == "" in the block using client.Machines(...).Get and Eventually)
can hang the cleanup if the node never registers; change the logic so the test
does not block deletion: perform a bounded/non-blocking poll for NodeRef (use a
short timeout or context) and then proceed to call client.Machines(...).Delete
for machineName regardless of whether nodeName was discovered, ensuring the
Delete is always attempted even if m.Status.NodeRef remained nil.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2ab38731-aa57-44fe-b513-16b9dbab39c1
📒 Files selected for processing (1)
openshift-tests/test/e2e/dedicated_hosts.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@openshift-tests/test/e2e/dedicated_hosts.go`:
- Around line 539-541: The code calls Fail(fmt.Sprintf(...)) inside the
Eventually check which aborts the test instead of allowing retries; change the
Fail(...) call in the block that checks host state to log the unexpected error
(e.g., using framework.Logf, GinkgoWriter, or t.Logf) and then return false so
Eventually can retry on transient AWS API errors — locate the block referencing
dynamicHostID and replace the Fail(...) invocation with a non-fatal log plus
return false.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 3d903dd0-03d2-480f-bdd1-99bc1ecb4730
📒 Files selected for processing (1)
openshift-tests/test/e2e/dedicated_hosts.go
|
/retest |
|
/lgtm |
|
/retest |
|
@vr4manta: This pull request references SPLAT-2715 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: damdo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@vr4manta this looks ready to merge. Feel free to add the verified label once you are happy with the test working as expected. |
|
/verified by @vr4manta |
|
@vr4manta: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
A lot of the failed e2e are from sig-node; is there a known issue with the test suite that's causing this? |
|
@vr4manta: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/cherry-pick release-4.22 |
|
@vr4manta: new pull request created: #184 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
SPLAT-2715
Changes