Skip to content

OCPBUGS-38859: add api-unreachable-from-client monitor test #29003

Merged
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
tkashem:client-view
Aug 26, 2024
Merged

OCPBUGS-38859: add api-unreachable-from-client monitor test #29003
openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
tkashem:client-view

Conversation

@tkashem
Copy link
Contributor

@tkashem tkashem commented Aug 13, 2024

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics

sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)

The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

image
From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?

2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found

image

Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2024
@openshift-ci openshift-ci bot requested review from deads2k and soltysh August 13, 2024 20:58
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 859c24a

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-serial High
[sig-storage] CSI Volumes [Driver: csi-hostpath] [Testpattern: Dynamic PV (filesystem volmode)] volumeLimits should support volume limits [Serial] [Suite:openshift/conformance/serial] [Suite:k8s]
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-api-machinery] OpenAPIV3 should contain OpenAPI V3 for Aggregated APIServer [Serial] [Suite:openshift/conformance/serial] [Suite:k8s]
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.
---
[sig-node] static pods should start after being created
This test has passed 100.00% of 34 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

Open Bugs
Static pod controller pods sometimes fail to start
etcd recovery test has static pod startup failure

@tkashem tkashem changed the title [WIP] add timeline for client view of API reachability Add 'api-unreachable-from-client' monitor test Aug 16, 2024
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 16, 2024
@tkashem tkashem force-pushed the client-view branch 2 times, most recently from 62869ae to 386882c Compare August 16, 2024 18:51
@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 386882c

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 99.17% of 120 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-serial Medium
[sig-apps] Daemon set [Serial] should surge pods onto nodes when spec was updated and update strategy is RollingUpdate [Suite:openshift/conformance/serial] [Suite:k8s]
This test has passed 97.44% of 39 runs on jobs ['periodic-ci-openshift-release-master-ci-4.18-e2e-aws-ovn-serial' 'periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-serial'] in the last 14 days.

@tkashem
Copy link
Contributor Author

tkashem commented Aug 17, 2024

Follow up from: #27976, #27986

/cc @dgoodwin @p0lyn0mial @sanchezl @vrutkovs

query: &metrics.PrometheusQueryRunner{
Client: client,
QueryString: `sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)`,
Step: time.Minute,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a minute granular enough to do the debugging necessary for problems this will expose? We're normally working in seconds, and IIRC our scape interval is lower than 1 min, 15 or 30s I thought.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to explain this in the godoc:

The intervals are scraped from metrics, so they don't have the same granularity, as other intervals, since:
a) in OpenShift, metrics are scraped every 30s
b) for rate to be calculated, we need at lease two samples

If an api unreachable interval overlaps with an apiserver shutdown window, it is typically indicative of network issues at the load balancer layer. Since the intervals are grouped by host, we can also narrow it down to a particular host, for example, we have seen cases where connections over internal load balancer to be faulty at times while the service network operated just fine.
Let me know your thoughts.

endTime := end.Timestamp.Time()
if start == end {
// a disruption window with one sample
endTime = end.Timestamp.Time().Add(time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per above Is a minute correct here? More granular would be good. We typically add a second when we're forcing a point in time event to appear in the chart.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we scrape every 30s, a 1s granularity is not very meaningful here, i think. I am approximating the interval to [t-30s ... t+30s] for now. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that makes sense, thanks. Fine as is then.

},
analyzer: metrics.RateSeriesAnalyzer{},
builder: &intervalBuilder{
serviceNetworkIP: kubeSvc.Spec.ClusterIP,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we use the default kube service instead ? (kubernetes.default.svc)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not clear what you mean, I believe we are using the cluster IP of kubernetes.default.svc

return nil, fmt.Errorf("prometheus query %q returned error: %v", q.QueryString, err)
}
if len(warnings) > 0 {
framework.Logf("query %q #### warnings \\n\\t%v\\n\", strings.Join(warningsForQuery, \"\\n\\t\"", q.QueryString, warnings)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that strings.Join(warningsForQuery should be defined outside of the quotation marks :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, thanks for pointing it out, the copy paste escaped the quotes and it has evaded my eyes :)

}
to = &current
}
// is the entire range is a disruption?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, change to is the entire range a disruption?

defer callback.EndSeries()

var from, to *prometheustypes.SamplePair
for _, current := range series.Values {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the range creates the current var once and then reuses it ?
if yes, then we cannot simply store &current as it will point to the same variable.

Copy link
Contributor Author

@tkashem tkashem Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the golang team fixed it (starting with 1.22) - moved the loop scope variable to iteration scoped - https://go.dev/blog/loopvar-preview. That's why the tests passed. I made it a iteration scoped variable.

}

zero := prometheustypes.SampleValue(0)
matrix := result.(prometheustypes.Matrix)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check for nil/empty responses ?
why not to do a type assertion ? (not sure if line 19 is enough)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are fine, we check for nil and then make sure the type is right. prometheustypes.Matrix implements the Value interface

func (Matrix) Type() ValueType  { return ValMatrix }

}
func (b *intervalBuilder) EndSeries() { b.locator = monitorapi.Locator{} }

func (b *intervalBuilder) NewDisruptionInterval(metric prometheustypes.Metric, start, end *prometheustypes.SamplePair) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we check if start and end are not nil ?
same for metric which is a map I think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary, the analyzer invokes NewInterval with a non nil start, and end. But if you want i can add nil check

@tkashem
Copy link
Contributor Author

tkashem commented Aug 21, 2024

@dgoodwin a new screenshot of the intervals in the UI
image
taken from https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1826346961381363712

We can see client talking to the kube-apiserver using the internal load balancer is experiencing error and that coincides with apiserver shutdown intervals

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2024

@tkashem: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node-upgrade 27b70a3 link false /test e2e-aws-ovn-single-node-upgrade
ci/prow/e2e-aws-ovn-single-node 27b70a3 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-aws-ovn-single-node-serial 27b70a3 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-aws-ovn-cgroupsv2 27b70a3 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-metal-ipi-ovn 27b70a3 link false /test e2e-metal-ipi-ovn
ci/prow/e2e-openstack-ovn 27b70a3 link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-ipsec-serial 27b70a3 link false /test e2e-aws-ovn-ipsec-serial
ci/prow/e2e-aws-csi 27b70a3 link false /test e2e-aws-csi
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 27b70a3 link false /test e2e-aws-ovn-kube-apiserver-rollout

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@dgoodwin
Copy link
Contributor

This all looks ok to me. You'll need a jira in title, I'll approve in case there's more you want to sort out with @p0lyn0mial .

/approve

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 22, 2024
@tkashem tkashem changed the title Add 'api-unreachable-from-client' monitor test OCPBUGS-37862: add api-unreachable-from-client monitor test Aug 22, 2024
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 22, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-37862, which is invalid:

  • expected the bug to be open, but it isn't
  • expected the bug to target the "4.18.0" version, but no target version was set
  • expected the bug to be in one of the following states: NEW, ASSIGNED, POST, but it is Closed (Duplicate) instead

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics

sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)

The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

image
From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?

2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found

image

Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@tkashem tkashem changed the title OCPBUGS-37862: add api-unreachable-from-client monitor test OCPBUGS-38859: add api-unreachable-from-client monitor test Aug 22, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Aug 22, 2024
@openshift-ci-robot
Copy link

@tkashem: This pull request references Jira Issue OCPBUGS-38859, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.18.0) matches configured target version for branch (4.18.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @wangke19

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics

sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)

The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

image
From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?

2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found

image

Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from wangke19 August 22, 2024 15:03
@tkashem
Copy link
Contributor Author

tkashem commented Aug 23, 2024

/retest-required

@openshift-trt-bot
Copy link

Job Failure Risk Analysis for sha: 27b70a3

Job Name Failure Risk
pull-ci-openshift-origin-master-e2e-aws-ovn-single-node-upgrade High
[sig-arch] events should not repeat pathologically for ns/openshift-kube-apiserver-operator
This test has passed 98.18% of 165 runs on release 4.18 [Architecture:amd64 FeatureSet:default Installer:ipi Network:ovn NetworkStack:ipv4 Platform:aws SecurityMode:default Topology:single Upgrade:micro] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-ipsec-serial Medium
[bz-openshift-apiserver] clusteroperator/openshift-apiserver should not change condition/Available
This test has passed 91.97% of 4496 runs on release 4.18 [Overall] in the last week.
pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout Low
[Conformance][Suite:openshift/kube-apiserver/rollout][Jira:"kube-apiserver"][sig-kube-apiserver] kube-apiserver should roll out new revisions without disruption [apigroup:config.openshift.io][apigroup:operator.openshift.io]
This test has passed 50.00% of 24 runs on jobs ['periodic-ci-openshift-release-master-nightly-4.18-e2e-aws-ovn-kube-apiserver-rollout' 'periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-kube-apiserver-rollout'] in the last 14 days.

@p0lyn0mial
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 26, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 26, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dgoodwin, p0lyn0mial, tkashem

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tkashem
Copy link
Contributor Author

tkashem commented Aug 26, 2024

it needs an acknowledge-critical-fixes-only label, cc @dgoodwin

@tkashem
Copy link
Contributor Author

tkashem commented Aug 26, 2024

/label acknowledge-critical-fixes-only

(it's informing only, does not fail any test, it will help with the load balancer issues we are seeing in CI)

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label Aug 26, 2024
@openshift-ci-robot
Copy link

/retest-required

Remaining retests: 0 against base HEAD 78de9ed and 2 for PR HEAD 27b70a3 in total

@openshift-merge-bot openshift-merge-bot bot merged commit a1615ab into openshift:master Aug 26, 2024
@openshift-ci-robot
Copy link

@tkashem: Jira Issue OCPBUGS-38859: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-38859 has been moved to the MODIFIED state.

Details

In response to this:

This adds a new timeline api-uneachable, grouped by source: {internal-lb|service-network|external-lb|localhost}. It scrapes the rest_client_requests_total metrics

sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)

The number of timelines in the UI is bound by source, so it is a limited number.

The following shows the disruption intervals where clients observed api errors via api-int

image
From https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288

BTW, looks like the source webhook.openshift-console-operator.svc:9443 shows permanent error, looks to be a bad conversion webhook?

2024-08-16T21:24:15.202127907Z W0816 21:24:15.202025      14 reflector.go:547] storage/cacher.go:/console.openshift.io/consoleplugins: failed to list console.openshift.io/v1alpha1, Kind=ConsolePlugin: conversion webhook for console.openshift.io/v1, Kind=ConsolePlugin failed: Post "https://webhook.openshift-console-operator.svc:9443/crdconvert?timeout=30s": service "webhook" not found

image

Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

Distgit: openshift-enterprise-tests
This PR has been included in build openshift-enterprise-tests-container-v4.18.0-202408261944.p0.ga1615ab.assembly.stream.el9.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants