Skip to content

OCPBUGS-38859: add a test (that flakes) to detect faulty load balancer#29034

Merged
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
tkashem:faulty-lb
Aug 30, 2024
Merged

OCPBUGS-38859: add a test (that flakes) to detect faulty load balancer#29034
openshift-merge-bot[bot] merged 1 commit intoopenshift:masterfrom
tkashem:faulty-lb

Conversation

@tkashem
Copy link
Contributor

@tkashem tkashem commented Aug 26, 2024

No description provided.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2024
@openshift-ci openshift-ci bot requested review from deads2k and soltysh August 26, 2024 01:26
@tkashem tkashem force-pushed the faulty-lb branch 2 times, most recently from 2799f60 to 8afd3c8 Compare August 26, 2024 13:25
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 8afd3c8

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
operator conditions kube-apiserver
This test has passed 68.75% of 16 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
[sig-sippy] tests should finish with healthy operators
This test has passed 68.75% of 16 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

@tkashem tkashem changed the title [WIP] Add a test to detect faulty load balancer Add a test to detect faulty load balancer Aug 26, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2024
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: f5ea35e

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
[sig-sippy] tests should finish with healthy operators
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
operator conditions kube-apiserver
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

Copy link
Contributor

@dgoodwin dgoodwin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rest looks good to me.

lbType := unreachable.Condition.Locator.Keys[monitorapi.LocatorAPIUnreachableHostKey]
msg := fmt.Sprintf("client observed connection error, type: %s\nkube-apiserver: %s\n, client: %s\n", lbType, shutdown.String(), unreachable.String())
junit.testCase.FailureOutput.Output = fmt.Sprintf("%s\n%s", junit.testCase.FailureOutput.Output, msg)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test has to go out in flake only mode, otherwise there's a very good chance it shuts down all payloads. You'll need to refactor a little to return an additional junit testcase with no failure output to trigger it as a flake.

Sippy can then be used to find occurrences where it flaked. Once you know it's fully passing we can remove that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it makes sense, it already detects faulty load balancer with AWS, I have added a new junit testcase with no failure output.

@tkashem tkashem changed the title Add a test to detect faulty load balancer OCPBUGS-38859: add a test (that flakes) to detect faulty load balancer Aug 27, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Aug 27, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-38859, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 August 27, 2024 15:26
@tkashem
Copy link
Contributor Author

tkashem commented Aug 27, 2024

/label acknowledge-critical-fixes-only

(it does not fail yet, it flakes only so we can measure and fix, once the fixes are made, we can change it to a test that fails)

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 27, 2024
}
testCases = append(testCases, junit.testCase, flake)
}
return testCases
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I reading this correctly that if this is not going to return anything if the test fully passes? From what I can see, on success junit.testCase is nil, because Evaluate is never called. Then we get here, and return an empty slice.

You need to return a success case as well otherwise your pass rates will be all out of wack.

@tkashem tkashem force-pushed the faulty-lb branch 2 times, most recently from fa053c0 to 8e36b1b Compare August 27, 2024 17:33
@dgoodwin
Copy link
Contributor

/lgtm

I would just check that you can find passes and fails in the rehersals once they're in, but it looks good now.

@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Aug 27, 2024
@tkashem
Copy link
Contributor Author

tkashem commented Aug 27, 2024

/hold
(until we see some passes in rehearsals)

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 27, 2024
@tkashem
Copy link
Contributor Author

tkashem commented Aug 27, 2024

/retest

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 8e36b1b

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout Low
[sig-sippy] tests should finish with healthy operators
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.
---
operator conditions kube-apiserver
This test has passed 70.59% of 17 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-metal-ipi-ovn-kube-apiserver-rollout'] in the last 14 days.

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Aug 28, 2024
@tkashem
Copy link
Contributor Author

tkashem commented Aug 28, 2024

/test e2e-metal-ipi-ovn-kube-apiserver-rollout

add test that detects faulty load balancer using the
client metric and the apiserver graceful shutdown interval
@tkashem
Copy link
Contributor Author

tkashem commented Aug 29, 2024

Passed: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-metal-ipi-ovn-kube-apiserver-rollout/1828905534703538176

image
There is one client error interval, but it does not overlap with any kube-apisererver shutdown interval

juni test output under "Tests Passed":

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process

Skipped: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-aws-ovn-cgroupsv2/1828905495277080576
There are no kube-apiserver shutdown interval, and test log says

I0828 23:15:25.753028 309 monitortest.go:70] monitor[faulty-load-balancer]: found 0 interesting intervals, kube-apiserver shutdown interval count: 0

junit test ouput:

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process

Reason: No kube-apiserver shutdown interval found

Flake: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29034/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1828905505343410176

image

monitor tests log:

I0829 00:35:50.831077 296 monitortest.go:70] monitor[faulty-load-balancer]: found 29 interesting intervals, kube-apiserver shutdown interval count: 14

junit output:

: [sig-apimachinery] new and reused connections to kube-apiserver should be handled gracefully during the graceful termination process 
Run #0: Failed 0s
{  
client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:34:42.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:35:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:38:39.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:39:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:42:34.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:43:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:47:15.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:47:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:51:15.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:51:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:55:07.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 28 23:55:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 28 23:59:51.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:00:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:03:42.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:04:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:07:40.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:08:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:12:33.000 - 132s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:13:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:16:32.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:17:51.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:20:27.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-124-51.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:21:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:25:19.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-21-158.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:26:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s

client observed connection error during kube-apiserver rollout, type: internal-lb
kube-apiserver: Aug 29 00:29:21.000 - 131s  I namespace/openshift-kube-apiserver node/ pod/kube-apiserver-ip-10-0-123-235.us-east-2.compute.internal server/kube-apiserver constructed/graceful-shutdown-analyzer reason/GracefulAPIServerShutdown
client: Aug 29 00:30:21.046 - 60s   E host/internal-lb reason/APIUnreachableFromClientMetrics client observed API error(s), host: api-int.ci-op-rtzldrxw-af546.aws-2.ci.openshift.org:6443, duration: 1m0s
}
Run #1: Passed 

// b) we find at least one valid kube-apiserver shutdown interval, but no
// overlapping client error interval, this test is a pass
// c) we find at least one valid kube-apiserver shutdown interval, and at
// least one overlapping client error interval, this test is a flake
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgoodwin I revised the junit test output for Pass, Skip, and Flake, let me know your thoughts, examples are here #29034 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I don't recall seeing a skip in a monitortest yet.

@tkashem
Copy link
Contributor Author

tkashem commented Aug 29, 2024

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 29, 2024
@tkashem
Copy link
Contributor Author

tkashem commented Aug 29, 2024

/retest-required

@dgoodwin
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 29, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, sanchezl, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 458e1ea and 2 for PR HEAD cbc62c5 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a993c78 and 2 for PR HEAD cbc62c5 in total

1 similar comment
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD a993c78 and 2 for PR HEAD cbc62c5 in total

@tkashem
Copy link
Contributor Author

tkashem commented Aug 30, 2024

/retest-required

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2024

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node cbc62c5 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-ipsec-serial cbc62c5 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-ovn-single-node-upgrade cbc62c5 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-gcp-ovn-rt-upgrade cbc62c5 link false /test e2e-gcp-ovn-rt-upgrade
ci/prow/e2e-aws-ovn-upgrade cbc62c5 link false /test e2e-aws-ovn-upgrade
ci/prow/e2e-aws-ovn-cgroupsv2 cbc62c5 link false /test e2e-aws-ovn-cgroupsv2

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: cbc62c5

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial High
[sig-arch] events should not repeat pathologically for ns/openshift-authentication-operator
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[bz-Monitoring] clusteroperator/monitoring should not change condition/Available
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
monitoring ClusterOperator should not blip Available=Unknown on client rate limiter

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 8d619a5 and 2 for PR HEAD cbc62c5 in total

@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD f1ade57 and 2 for PR HEAD cbc62c5 in total

@openshift-merge-bot openshift-merge-bot bot merged commit 1ce76da into openshift:master Aug 30, 2024
@openshift-ci-robot
Copy link

@tkashem: Jira Issue OCPBUGS-38859: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38859 has been moved to the MODIFIED state.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202408301641.p0.g1ce76da.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants