add event interval from rest client metric broken down by source by vrutkovs · Pull Request #27986 · openshift/origin

vrutkovs · 2023-06-19T14:44:26Z

Make sure locator and message are shown as is on HTML page when rest_ increases.

Test bed for #27976 + fixes

TODO:

Metrics are scraped every 30 sec, is there a way to find how long disruptions during that period took? Seems its possible to find that out from rest_client_request_duration_seconds_sum?
Currently we'll create sequential one second intervals for each error found, but exact time and duration of the disruption cannot be properly derived

openshift-ci · 2023-06-19T14:44:31Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

vrutkovs · 2023-06-19T14:50:58Z

/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-single-node
/test e2e-gcp-ovn-upgrade

dgrisonnet · 2023-06-21T13:23:42Z

@vrutkovs could you perhaps enlightened me as to what is missing from this PR to still be in draft?

vrutkovs · 2023-06-21T13:36:23Z

I'm experimenting with query range to see if we can get disruptions for each second from metrics, which are scraped every 30 sec. Not sure if its possible at all, I'll add a TODO

vrutkovs · 2023-06-28T13:42:10Z

/cc @dinhxuanvu @mfojtik

Lets merge it today so that by mid-next week we'd see if that gives valid signal

dinhxuanvu · 2023-06-29T07:51:34Z

/retest-required

dinhxuanvu · 2023-07-25T14:20:21Z

@tkashem PTAL

tkashem · 2023-07-25T15:24:31Z

pkg/monitor/intervalcreation/podTest/container-life/expected.json

        {
            "level": "Info",
            "locator": "ns/e2e-kubectl-3271 pod/without-label uid/e185b70c-ea3e-4600-850a-b2370a729a73 container/without-label",
-            "message": "constructed/pod-lifecycle-constructor reason/ContainerWait missed real \"ContainerWait\"",


question: are these related to this PR?

tkashem · 2023-07-25T15:35:59Z

pkg/monitor/clientview/gatherer.go

+				message := fmt.Sprintf("client observed an API error - %s", series.Metric.String())
+				intervalsCount := int(current.Value) - int(previous)
+				if intervalsCount > 1 {
+					message = fmt.Sprintf("%s (%d times)", message, intervalsCount)


I didn't see (%d times) in the messages https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27986/pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade/1683745727487938560, I tried this int(current.Value) - int(previous) in my original PR and it caused an overflow and the result was negative in some cases, i did not have time to debug it.

also, counter can reset to zero, we need to handle it, if we want to display the increment correctly, right?

Oh, okay, I haven't seen the counter being reset (it may happen on apiserver restart I guess?). I'll probably revert this commit then

dinhxuanvu · 2023-07-27T20:30:12Z

/close

openshift-ci · 2023-07-27T20:30:23Z

@dinhxuanvu: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dinhxuanvu · 2023-07-27T20:30:28Z

/reopen

openshift-ci · 2023-07-27T20:30:45Z

@dinhxuanvu: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dinhxuanvu · 2023-08-07T14:29:05Z

/retest-required

tkashem · 2023-08-07T15:51:29Z

I do't see the timeline anymore: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27986/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1688504328878297088

dinhxuanvu · 2023-08-08T05:49:06Z

I do't see the timeline anymore: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27986/pull-ci-openshift-origin-master-e2e-gcp-ovn-rt-upgrade/1688504328878297088

They are back!

dinhxuanvu · 2023-08-08T15:22:55Z

@tkashem @dgoodwin PTAL. Would like to have this PR merge if there are no further issues.

dgoodwin · 2023-08-08T17:57:05Z

pkg/monitor/serialization/serialize.go

-		Locator:           interval.Locator,
-		Message:           interval.Message,
+		Locator:           html.EscapeString(interval.Locator),
+		Message:           html.EscapeString(interval.Message),


Suspect this will break making things beyond your PR and even with your intervals it comes out problematic:

"message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}",

This ceases to be readable, but there is hope. David and I are in the midst of making intervals more structured, we're only part way through but the features are available for you to start to use, and I think in this case it will avoid the mess above. I'll do add details on how in the place where you construct your interval.

dgoodwin · 2023-08-08T17:59:38Z

pkg/monitor/clientview/gatherer.go

+					From: current.Timestamp.Time(),
+					// TODO: find how long did the requests took using data from rest_client_request_duration_seconds_sum?
+					To: current.Timestamp.Time().Add(time.Second),
+				}


Per below this is a prime use case for structured intervals:

Currently you're serializing like this:

{ "level": "Error", "locator": "client/APIError source/service-network node/10.129.0.40 namespace/openshift-console-operator component/openshift-console-operator", "message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}", "tempStructuredLocator": { "type": "", "keys": null }, "tempStructuredMessage": { "reason": "", "cause": "", "humanMessage": "", "annotations": null }, "from": "2023-08-07T21:49:25Z", "to": "2023-08-07T21:49:26Z" },

Intervals can be created with

monitorapi.NewInterval(monitorapi.SourceRESTClientMonitor, monitorapi.Warning). Locator(monitorapi.NewLocator().RestAPIError([yourparamsforlocator])). Message(msg.HumanMessage("client observed an API error")).Display().Build()

You'll want a locator type for your new client API Errors, a locator constructor to create it, and then use message annotations for previous, current count.

Is it necessary to have the full promql included? If so, maybe in it's own message annotation, with the escaping done here, but the problem is how do we know when to unescape? Does it serialize ok normally without the html escaping if we put it into a structured field?

dgoodwin · 2023-08-08T18:11:35Z

pkg/monitor/clientview/gatherer.go

+		}
+		previous := series.Values[0].Value
+		for _, current := range series.Values[1:] {
+			if !previous.Equal(current.Value) {


Am I correct that if we're experiencing client errors over a prolonged period of time, every metric sample that we increment results in it's own interval? How far apart are the timestamps on each value? Every 15s?

I'm concerned about bounding here, if a bad job is going to generate a million intervals? That would break a bunch of things and cost us money likely too.

For example with disruption we record a single interval for a prolonged period of disruption, we record when we start seeing disruption (and what error message), and then watch for it to stop or for the error message to change. Either event results in the interval now having a To time, so we create it.

IMO you should apply similar here such that a prolonged uninterrupted period of is just one interval. Granted this may get complicated, maybe we have one value where it doesn't change and then it starts going up again, that would make a new interval, but that would be better than nothing. Depends on how far apart the prometheus samples are as well.

every metric sample that we increment results in it's own interval?

No, not exactly. We're going to create one interval for every scrape period (30s) if number of client errors has increased.

I'm concerned about bounding here, if a bad job is going to generate a million intervals?

No, one interval every 30s at most.

Ok one ever 30s, but is that also per time series returned by your promql? (it looks like that might catch a fair bit) Example if that promql returns 500 time series and we have a bad job run with a 30min problem, thats 500 * 30 * 2 or 30k intervals, which would double the number I'd expect normally.

If the error counter for a time series increments at t, t+30s, t+60s, t+90s, I think it's probably worth the effort to make that one interval as we do for disruption. That would involve tracking start points, watching for a sample that didn't increment and using that as a trigger to terminate that interval and add it to the list, and terminating any intervals that were dangling at the end of the job run.

Its per time series, yes, but I don't expect it to be significant. Its "error code" * "source namespace", so in the worst case its 10 error codes * 50 namespaces * 3 ways to reach API * (30 mins job / 30 secs interval) = 750 intervals for "every component firing 10 different errors throught the whole test duration" case

Unless I'm missing something, 10 error codes x 50 namespaces x 3 api points x 60 (30 minutes at 30s intervals) = 90,000, not 750? I know it's an extreme example by even 10% of that is a lot more for our JS UI to handle. I think you need to batch based on consecutive failed samples.

Ah, you're correct. Right, this needs interval batching

Implemented, now sequential intervals are merged into one:

Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] adding new interval Aug 23 08:36:58.784 - 1s E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"} ... Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] updated existing interval Aug 23 08:36:58.784 - 30s E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"}

dgoodwin · 2023-08-08T18:12:33Z

pkg/invariants/testframework/alertanalyzer/invariant.go


 func (w *alertSummarySerializer) CollectData(ctx context.Context, storageDir string, beginning, end time.Time) (monitorapi.Intervals, []*junitapi.JUnitTestCase, error) {
 	intervals, err := fetchEventIntervalsForAllAlerts(ctx, w.adminRESTConfig, beginning)
+	clientEventIntervals, err2 := clientview.FetchEventIntervalsForRestClientError(ctx, w.adminRESTConfig, beginning)


These are not conceptually linked to the alert analyzer tests right? How hard is it to break these out into their own InvariantTest? You just need an adminRESTConfig, should be pretty easy.

Right, its best to be moved its own InvariantTest

openshift-ci · 2023-08-21T10:40:57Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci · 2023-08-23T10:05:29Z

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-single-node	`86d7dea`	link	false	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-metal-ipi-ovn-ipv6	`86d7dea`	link	false	`/test e2e-metal-ipi-ovn-ipv6`
ci/prow/e2e-openstack-ovn	`86d7dea`	link	false	`/test e2e-openstack-ovn`
ci/prow/e2e-aws-ovn-single-node-upgrade	`86d7dea`	link	false	`/test e2e-aws-ovn-single-node-upgrade`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2023-11-22T01:01:03Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-merge-robot · 2023-11-22T01:01:13Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-bot · 2023-12-22T08:30:18Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2024-01-22T00:00:23Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci · 2024-01-22T00:01:05Z

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2023

vrutkovs changed the title ~~add event interval from rest client metric broken down by source~~ WIP add event interval from rest client metric broken down by source Jun 21, 2023

vrutkovs marked this pull request as ready for review June 21, 2023 13:46

openshift-ci bot requested review from csrwng and deads2k June 21, 2023 13:47

vrutkovs force-pushed the client-view-v2 branch 3 times, most recently from d64d51d to 2672967 Compare June 22, 2023 13:53

vrutkovs changed the title ~~WIP add event interval from rest client metric broken down by source~~ add event interval from rest client metric broken down by source Jun 28, 2023

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2023

openshift-ci bot requested review from dinhxuanvu and mfojtik June 28, 2023 13:42

vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 4f14eab to 87baeea Compare July 25, 2023 07:46

vrutkovs force-pushed the client-view-v2 branch from 87baeea to dacfad1 Compare July 25, 2023 15:53

tkashem reviewed Jul 25, 2023

View reviewed changes

vrutkovs force-pushed the client-view-v2 branch from dacfad1 to 052ddc6 Compare July 27, 2023 17:29

openshift-ci bot closed this Jul 27, 2023

openshift-ci bot reopened this Jul 27, 2023

dgoodwin reviewed Aug 8, 2023

View reviewed changes

vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 2eeba1d to b815593 Compare August 9, 2023 11:42

vrutkovs force-pushed the client-view-v2 branch from b815593 to 27959cb Compare August 21, 2023 10:39

vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 2ecd7cb to 7640d61 Compare August 21, 2023 19:24

monitors: show stack trace when collect data crashes

5a153cd

vrutkovs force-pushed the client-view-v2 branch 3 times, most recently from ffa9a66 to bc0bc03 Compare August 22, 2023 18:21

add event interval from rest client metric broken down by source

86d7dea

vrutkovs force-pushed the client-view-v2 branch from bc0bc03 to 86d7dea Compare August 23, 2023 06:42

vrutkovs requested a review from dgoodwin August 23, 2023 09:24

openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2023

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 22, 2023

openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2023

openshift-ci bot closed this Jan 22, 2024

tkashem mentioned this pull request Aug 17, 2024

OCPBUGS-38859: add api-unreachable-from-client monitor test #29003

Merged

Conversation

vrutkovs commented Jun 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci bot commented Jun 19, 2023

Uh oh!

vrutkovs commented Jun 19, 2023

Uh oh!

dgrisonnet commented Jun 21, 2023

Uh oh!

vrutkovs commented Jun 21, 2023

Uh oh!

vrutkovs commented Jun 28, 2023

Uh oh!

dinhxuanvu commented Jun 29, 2023

Uh oh!

dinhxuanvu commented Jul 25, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dinhxuanvu commented Jul 27, 2023

Uh oh!

openshift-ci bot commented Jul 27, 2023

Uh oh!

dinhxuanvu commented Jul 27, 2023

Uh oh!

openshift-ci bot commented Jul 27, 2023

Uh oh!

dinhxuanvu commented Aug 7, 2023

Uh oh!

tkashem commented Aug 7, 2023

Uh oh!

dinhxuanvu commented Aug 8, 2023

Uh oh!

dinhxuanvu commented Aug 8, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs Aug 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Aug 21, 2023

Uh oh!

openshift-ci bot commented Aug 23, 2023

Uh oh!

openshift-bot commented Nov 22, 2023

Uh oh!

openshift-merge-robot commented Nov 22, 2023

Uh oh!

openshift-bot commented Dec 22, 2023

Uh oh!

openshift-bot commented Jan 22, 2024

Uh oh!

openshift-ci bot commented Jan 22, 2024

Uh oh!

Reviewers

Assignees

vrutkovs commented Jun 19, 2023 •

edited

Loading

vrutkovs Aug 9, 2023 •

edited

Loading