OCPBUGS-38859: add api-unreachable-from-client monitor test by tkashem · Pull Request #29003 · openshift/origin

tkashem · 2024-08-13T20:57:58Z

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics

sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)

The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?

2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found

Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

openshift-trt-bot · 2024-08-14T22:06:19Z

Job Failure Risk Analysis for sha: 859c24a

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	High [sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (filesystem volmode)] volumeLimits should support volume limits [Serial] [Suite:openshift/conformance/serial] [Suite:k8s] This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-api-machinery] OpenAPIV3 should contain OpenAPI V3 for Aggregated APIServer [Serial] [Suite:openshift/conformance/serial] [Suite:k8s] This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. --- [sig-node] static pods should start after being created This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days. Open Bugs Static pod controller pods sometimes fail to start etcd recovery test has static pod startup failure

openshift-trt-bot · 2024-08-16T22:51:11Z

Job Failure Risk Analysis for sha: 386882c

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	High [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator This test has passed 99.17% of 120 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-serial	Medium [sig-apps] Daemon set [Serial] should surge pods onto nodes when spec was updated and update strategy is RollingUpdate [Suite:openshift/conformance/serial] [Suite:k8s] This test has passed 97.44% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

tkashem · 2024-08-17T00:35:38Z

Follow up from: #27976, #27986

/cc @dgoodwin @p0lyn0mial @sanchezl @vrutkovs

pkg/monitortests/client/api_unreachable.go

dgoodwin · 2024-08-19T11:44:52Z

pkg/monitortests/client/api_unreachable.go

+		query: &metrics.PrometheusQueryRunner{
+			Client:      client,
+			QueryString: `sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)`,
+			Step:        time.Minute,


Is a minute granular enough to do the debugging necessary for problems this will expose? We're normally working in seconds, and IIRC our scape interval is lower than 1 min, 15 or 30s I thought.

I tried to explain this in the godoc:

The intervals are scraped from metrics, so they don't have the same granularity, as other intervals, since: a) in OpenShift, metrics are scraped every 30s b) for rate to be calculated, we need at lease two samples

If an api unreachable interval overlaps with an apiserver shutdown window, it is typically indicative of network issues at the load balancer layer. Since the intervals are grouped by host, we can also narrow it down to a particular host, for example, we have seen cases where connections over internal load balancer to be faulty at times while the service network operated just fine.
Let me know your thoughts.

dgoodwin · 2024-08-19T11:45:34Z

pkg/monitortests/client/api_unreachable.go

+	endTime := end.Timestamp.Time()
+	if start == end {
+		// a disruption window with one sample
+		endTime = end.Timestamp.Time().Add(time.Minute)


Per above Is a minute correct here? More granular would be good. We typically add a second when we're forcing a point in time event to appear in the chart.

since we scrape every 30s, a 1s granularity is not very meaningful here, i think. I am approximating the interval to [t-30s ... t+30s] for now. Thoughts?

Ah that makes sense, thanks. Fine as is then.

pkg/monitortests/client/api_unreachable.go

e2echart/e2e-chart-template.html

pkg/defaultmonitortests/types.go

pkg/monitor/monitorapi/types.go

p0lyn0mial · 2024-08-20T07:32:15Z

pkg/monitortests/client/api_unreachable.go

+		},
+		analyzer: metrics.RateSeriesAnalyzer{},
+		builder: &intervalBuilder{
+			serviceNetworkIP: kubeSvc.Spec.ClusterIP,


could we use the default kube service instead ? (kubernetes.default.svc)

not clear what you mean, I believe we are using the cluster IP of kubernetes.default.svc

p0lyn0mial · 2024-08-20T07:55:54Z

pkg/monitortests/metrics/metrics.go

+		return nil, fmt.Errorf("prometheus query %q returned error: %v", q.QueryString, err)
+	}
+	if len(warnings) > 0 {
+		framework.Logf("query %q #### warnings \\n\\t%v\\n\", strings.Join(warningsForQuery, \"\\n\\t\"", q.QueryString, warnings)


I think that strings.Join(warningsForQuery should be defined outside of the quotation marks :)

hehe, thanks for pointing it out, the copy paste escaped the quotes and it has evaded my eyes :)

pkg/monitortests/metrics/rate_analyzer.go

p0lyn0mial · 2024-08-20T08:05:27Z

pkg/monitortests/metrics/rate_analyzer.go

+				}
+				to = &current
+			}
+			// is the entire range is a disruption?


typo, change to is the entire range a disruption?

p0lyn0mial · 2024-08-20T08:12:26Z

pkg/monitortests/metrics/rate_analyzer.go

+			defer callback.EndSeries()
+
+			var from, to *prometheustypes.SamplePair
+			for _, current := range series.Values {


does the range creates the current var once and then reuses it ?
if yes, then we cannot simply store &current as it will point to the same variable.

I think the golang team fixed it (starting with 1.22) - moved the loop scope variable to iteration scoped - https://go.dev/blog/loopvar-preview. That's why the tests passed. I made it a iteration scoped variable.

p0lyn0mial · 2024-08-20T08:18:47Z

pkg/monitortests/metrics/rate_analyzer.go

+	}
+
+	zero := prometheustypes.SampleValue(0)
+	matrix := result.(prometheustypes.Matrix)


should we check for nil/empty responses ?
why not to do a type assertion ? (not sure if line 19 is enough)

I think we are fine, we check for nil and then make sure the type is right. prometheustypes.Matrix implements the Value interface

func (Matrix) Type() ValueType { return ValMatrix }

p0lyn0mial · 2024-08-20T08:21:21Z

pkg/monitortests/client/api_unreachable.go

+}
+func (b *intervalBuilder) EndSeries() { b.locator = monitorapi.Locator{} }
+
+func (b *intervalBuilder) NewDisruptionInterval(metric prometheustypes.Metric, start, end *prometheustypes.SamplePair) {


should we check if start and end are not nil ?
same for metric which is a map I think ?

I don't think it's necessary, the analyzer invokes NewInterval with a non nil start, and end. But if you want i can add nil check

tkashem · 2024-08-21T23:31:01Z

@dgoodwin a new screenshot of the intervals in the UI

taken from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1826346961381363712

We can see client talking to the kube-apiserver using the internal load balancer is experiencing error and that coincides with apiserver shutdown intervals

openshift-ci · 2024-08-21T23:43:19Z

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade	`27b70a3`	link	false	`/test e2e-aws-ovn-single-node-upgrade`
ci/prow/e2e-aws-ovn-single-node	`27b70a3`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-aws-ovn-single-node-serial	`27b70a3`	link	false	`/test e2e-aws-ovn-single-node-serial`
ci/prow/e2e-aws-ovn-cgroupsv2	`27b70a3`	link	false	`/test e2e-aws-ovn-cgroupsv2`
ci/prow/e2e-metal-ipi-ovn	`27b70a3`	link	false	`/test e2e-metal-ipi-ovn`
ci/prow/e2e-openstack-ovn	`27b70a3`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-ipsec-serial	`27b70a3`	link	false	`/test e2e-aws-ovn-ipsec-serial`
ci/prow/e2e-aws-csi	`27b70a3`	link	false	`/test e2e-aws-csi`
ci/prow/e2e-aws-ovn-kube-apiserver-rollout	`27b70a3`	link	false	`/test e2e-aws-ovn-kube-apiserver-rollout`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

dgoodwin · 2024-08-22T14:23:45Z

This all looks ok to me. You'll need a jira in title, I'll approve in case there's more you want to sort out with @p0lyn0mial .

/approve

openshift-ci-robot · 2024-08-22T14:57:52Z

@tkashem: This pull request references Jira Issue OCPBUGS-37862, which is invalid:

expected the bug to be open, but it isn't
expected the bug to target the "4.18.0" version, but no target version was set
expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Duplicate) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics
sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)
The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?
2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found
Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2024-08-22T15:03:05Z

@tkashem: This pull request references Jira Issue OCPBUGS-38859, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.18.0) matches configured target version for branch (4.18.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics
sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)
The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?
2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found
Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tkashem · 2024-08-23T14:08:21Z

/retest-required

openshift-trt-bot · 2024-08-23T18:08:49Z

Job Failure Risk Analysis for sha: 27b70a3

Job Name	Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade	High [sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator This test has passed 98.18% of 165 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial	Medium [bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available This test has passed 91.97% of 4496 runs on release 4.18 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout	Low [Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io] This test has passed 50.00% of 24 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

p0lyn0mial · 2024-08-26T09:51:03Z

/lgtm

openshift-ci · 2024-08-26T09:53:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, p0lyn0mial, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [dgoodwin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tkashem · 2024-08-26T13:06:43Z

it needs an acknowledge-critical-fixes-only label, cc @dgoodwin

tkashem · 2024-08-26T13:35:10Z

/label acknowledge-critical-fixes-only

(it's informing only, does not fail any test, it will help with the load balancer issues we are seeing in CI)

openshift-ci-robot · 2024-08-26T14:11:35Z

/retest-required

Remaining retests: 0 against base HEAD 78de9ed and 2 for PR HEAD 27b70a3 in total

openshift-ci-robot · 2024-08-26T17:05:32Z

@tkashem: Jira Issue OCPBUGS-38859: All pull requests linked via external trackers have merged:

openshift/origin#29003

Jira Issue OCPBUGS-38859 has been moved to the MODIFIED state.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics
sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)
The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?
2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found
Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-08-26T23:01:24Z

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202408261944.p0.ga1615ab.assembly.stream.el9.
All builds following this will include this PR.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2024

openshift-ci bot requested review from deads2k and soltysh August 13, 2024 20:58

tkashem force-pushed the client-view branch from c596d87 to 859c24a Compare August 14, 2024 18:08

tkashem force-pushed the client-view branch from 859c24a to 87a7306 Compare August 16, 2024 18:15

tkashem changed the title ~~[WIP] add timeline for client view of API reachability~~ Add 'api-unreachable-from-client' monitor test Aug 16, 2024

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2024

tkashem force-pushed the client-view branch 2 times, most recently from 62869ae to 386882c Compare August 16, 2024 18:51

openshift-ci bot requested review from dgoodwin, p0lyn0mial, sanchezl and vrutkovs August 17, 2024 00:35

dgoodwin reviewed Aug 19, 2024

View reviewed changes

p0lyn0mial reviewed Aug 20, 2024

View reviewed changes

tkashem added 2 commits August 21, 2024 11:52

add reusable metrics analyzer for monitor tests

2af0e81

add api-unreachable-from-client monitor test

27b70a3

tkashem force-pushed the client-view branch from 386882c to 27b70a3 Compare August 21, 2024 19:53

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2024

tkashem changed the title ~~Add 'api-unreachable-from-client' monitor test~~ OCPBUGS-37862: add api-unreachable-from-client monitor test Aug 22, 2024

tkashem changed the title ~~OCPBUGS-37862: add api-unreachable-from-client monitor test~~ OCPBUGS-38859: add api-unreachable-from-client monitor test Aug 22, 2024

openshift-ci bot requested a review from wangke19 August 22, 2024 15:03

openshift-ci bot assigned p0lyn0mial Aug 26, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 26, 2024

openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 26, 2024

openshift-merge-bot bot merged commit a1615ab into openshift:master Aug 26, 2024

tkashem mentioned this pull request Aug 27, 2024

[WIP] remove missing conversion webhook openshift/nmstate-console-plugin#50

Closed

tkashem deleted the client-view branch August 29, 2024 17:57

openshift-ci-robot mentioned this pull request Aug 30, 2024

OCPBUGS-38859: add a test (that flakes) to detect faulty load balancer #29034

Merged

Conversation

tkashem commented Aug 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-trt-bot commented Aug 14, 2024

Uh oh!

openshift-trt-bot commented Aug 16, 2024

Uh oh!

tkashem commented Aug 17, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tkashem commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Aug 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dgoodwin commented Aug 22, 2024

Uh oh!

openshift-ci-robot commented Aug 22, 2024

Uh oh!

openshift-ci-robot commented Aug 22, 2024

Uh oh!

tkashem commented Aug 23, 2024

Uh oh!

openshift-trt-bot commented Aug 23, 2024

Uh oh!

p0lyn0mial commented Aug 26, 2024

Uh oh!

openshift-ci bot commented Aug 26, 2024

Uh oh!

tkashem commented Aug 26, 2024

Uh oh!

tkashem commented Aug 26, 2024

Uh oh!

openshift-ci-robot commented Aug 26, 2024

Uh oh!

tkashem commented Aug 13, 2024 •

edited

Loading

tkashem Aug 21, 2024 •

edited

Loading

tkashem commented Aug 21, 2024 •

edited

Loading

openshift-ci bot commented Aug 21, 2024 •

edited

Loading