Skip to content

add event interval from rest client metric broken down by source#27986

Closed
vrutkovs wants to merge 2 commits intoopenshift:masterfrom
vrutkovs:client-view-v2
Closed

add event interval from rest client metric broken down by source#27986
vrutkovs wants to merge 2 commits intoopenshift:masterfrom
vrutkovs:client-view-v2

Conversation

@vrutkovs
Copy link
Contributor

@vrutkovs vrutkovs commented Jun 19, 2023

Make sure locator and message are shown as is on HTML page when rest_ increases.

Test bed for #27976 + fixes

TODO:

  • Metrics are scraped every 30 sec, is there a way to find how long disruptions during that period took? Seems its possible to find that out from rest_client_request_duration_seconds_sum?
    Currently we'll create sequential one second intervals for each error found, but exact time and duration of the disruption cannot be properly derived

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 19, 2023

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@vrutkovs
Copy link
Contributor Author

/test e2e-aws-ovn-single-node-upgrade
/test e2e-aws-ovn-single-node-serial
/test e2e-aws-ovn-upgrade
/test e2e-aws-ovn-single-node
/test e2e-gcp-ovn-upgrade

@dgrisonnet
Copy link
Member

@vrutkovs could you perhaps enlightened me as to what is missing from this PR to still be in draft?

@vrutkovs
Copy link
Contributor Author

I'm experimenting with query range to see if we can get disruptions for each second from metrics, which are scraped every 30 sec. Not sure if its possible at all, I'll add a TODO

@vrutkovs vrutkovs changed the title add event interval from rest client metric broken down by source WIP add event interval from rest client metric broken down by source Jun 21, 2023
@vrutkovs vrutkovs marked this pull request as ready for review June 21, 2023 13:46
@openshift-ci openshift-ci bot requested review from csrwng and deads2k June 21, 2023 13:47
@vrutkovs vrutkovs force-pushed the client-view-v2 branch 3 times, most recently from d64d51d to 2672967 Compare June 22, 2023 13:53
@vrutkovs vrutkovs changed the title WIP add event interval from rest client metric broken down by source add event interval from rest client metric broken down by source Jun 28, 2023
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 28, 2023
@vrutkovs
Copy link
Contributor Author

/cc @dinhxuanvu @mfojtik

Lets merge it today so that by mid-next week we'd see if that gives valid signal

@openshift-ci openshift-ci bot requested review from dinhxuanvu and mfojtik June 28, 2023 13:42
@dinhxuanvu
Copy link
Member

/retest-required

@vrutkovs vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 4f14eab to 87baeea Compare July 25, 2023 07:46
@dinhxuanvu
Copy link
Member

@tkashem PTAL

{
"level": "Info",
"locator": "ns/e2e-kubectl-3271 pod/without-label uid/e185b70c-ea3e-4600-850a-b2370a729a73 container/without-label",
"message": "constructed/pod-lifecycle-constructor reason/ContainerWait missed real \"ContainerWait\"",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: are these related to this PR?

message := fmt.Sprintf("client observed an API error - %s", series.Metric.String())
intervalsCount := int(current.Value) - int(previous)
if intervalsCount > 1 {
message = fmt.Sprintf("%s (%d times)", message, intervalsCount)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see (%d times) in the messages https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27986/pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade/1683745727487938560, I tried this int(current.Value) - int(previous) in my original PR and it caused an overflow and the result was negative in some cases, i did not have time to debug it.

also, counter can reset to zero, we need to handle it, if we want to display the increment correctly, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay, I haven't seen the counter being reset (it may happen on apiserver restart I guess?). I'll probably revert this commit then

@dinhxuanvu
Copy link
Member

/close

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 27, 2023

@dinhxuanvu: Closed this PR.

Details

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot closed this Jul 27, 2023
@dinhxuanvu
Copy link
Member

/reopen

@openshift-ci openshift-ci bot reopened this Jul 27, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 27, 2023

@dinhxuanvu: Reopened this PR.

Details

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dinhxuanvu
Copy link
Member

/retest-required

@tkashem
Copy link
Contributor

tkashem commented Aug 7, 2023

@dinhxuanvu
Copy link
Member

@dinhxuanvu
Copy link
Member

@tkashem @dgoodwin PTAL. Would like to have this PR merge if there are no further issues.

Locator: interval.Locator,
Message: interval.Message,
Locator: html.EscapeString(interval.Locator),
Message: html.EscapeString(interval.Message),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspect this will break making things beyond your PR and even with your intervals it comes out problematic:

"message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}",

This ceases to be readable, but there is hope. David and I are in the midst of making intervals more structured, we're only part way through but the features are available for you to start to use, and I think in this case it will avoid the mess above. I'll do add details on how in the place where you construct your interval.

From: current.Timestamp.Time(),
// TODO: find how long did the requests took using data from rest_client_request_duration_seconds_sum?
To: current.Timestamp.Time().Add(time.Second),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per below this is a prime use case for structured intervals:

Currently you're serializing like this:

       {
            "level": "Error",
            "locator": "client/APIError source/service-network node/10.129.0.40 namespace/openshift-console-operator component/openshift-console-operator",
            "message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}",
            "tempStructuredLocator": {
                "type": "",
                "keys": null
            },
            "tempStructuredMessage": {
                "reason": "",
                "cause": "",
                "humanMessage": "",
                "annotations": null
            },
            "from": "2023-08-07T21:49:25Z",
            "to": "2023-08-07T21:49:26Z"
        },

Intervals can be created with

monitorapi.NewInterval(monitorapi.SourceRESTClientMonitor, monitorapi.Warning).
		Locator(monitorapi.NewLocator().RestAPIError([yourparamsforlocator])).
		Message(msg.HumanMessage("client observed an API error")).Display().Build()

You'll want a locator type for your new client API Errors, a locator constructor to create it, and then use message annotations for previous, current count.

Is it necessary to have the full promql included? If so, maybe in it's own message annotation, with the escaping done here, but the problem is how do we know when to unescape? Does it serialize ok normally without the html escaping if we put it into a structured field?

}
previous := series.Values[0].Value
for _, current := range series.Values[1:] {
if !previous.Equal(current.Value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that if we're experiencing client errors over a prolonged period of time, every metric sample that we increment results in it's own interval? How far apart are the timestamps on each value? Every 15s?

I'm concerned about bounding here, if a bad job is going to generate a million intervals? That would break a bunch of things and cost us money likely too.

For example with disruption we record a single interval for a prolonged period of disruption, we record when we start seeing disruption (and what error message), and then watch for it to stop or for the error message to change. Either event results in the interval now having a To time, so we create it.

IMO you should apply similar here such that a prolonged uninterrupted period of is just one interval. Granted this may get complicated, maybe we have one value where it doesn't change and then it starts going up again, that would make a new interval, but that would be better than nothing. Depends on how far apart the prometheus samples are as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

every metric sample that we increment results in it's own interval?

No, not exactly. We're going to create one interval for every scrape period (30s) if number of client errors has increased.

I'm concerned about bounding here, if a bad job is going to generate a million intervals?

No, one interval every 30s at most.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok one ever 30s, but is that also per time series returned by your promql? (it looks like that might catch a fair bit) Example if that promql returns 500 time series and we have a bad job run with a 30min problem, thats 500 * 30 * 2 or 30k intervals, which would double the number I'd expect normally.

If the error counter for a time series increments at t, t+30s, t+60s, t+90s, I think it's probably worth the effort to make that one interval as we do for disruption. That would involve tracking start points, watching for a sample that didn't increment and using that as a trigger to terminate that interval and add it to the list, and terminating any intervals that were dangling at the end of the job run.

Copy link
Contributor Author

@vrutkovs vrutkovs Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its per time series, yes, but I don't expect it to be significant. Its "error code" * "source namespace", so in the worst case its 10 error codes * 50 namespaces * 3 ways to reach API * (30 mins job / 30 secs interval) = 750 intervals for "every component firing 10 different errors throught the whole test duration" case

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, 10 error codes x 50 namespaces x 3 api points x 60 (30 minutes at 30s intervals) = 90,000, not 750? I know it's an extreme example by even 10% of that is a lot more for our JS UI to handle. I think you need to batch based on consecutive failed samples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, you're correct. Right, this needs interval batching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented, now sequential intervals are merged into one:

  Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] adding new interval Aug 23 08:36:58.784 - 1s    E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"}
...
  Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] updated existing interval Aug 23 08:36:58.784 - 30s   E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"}


func (w *alertSummarySerializer) CollectData(ctx context.Context, storageDir string, beginning, end time.Time) (monitorapi.Intervals, []*junitapi.JUnitTestCase, error) {
intervals, err := fetchEventIntervalsForAllAlerts(ctx, w.adminRESTConfig, beginning)
clientEventIntervals, err2 := clientview.FetchEventIntervalsForRestClientError(ctx, w.adminRESTConfig, beginning)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are not conceptually linked to the alert analyzer tests right? How hard is it to break these out into their own InvariantTest? You just need an adminRESTConfig, should be pretty easy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, its best to be moved its own InvariantTest

@vrutkovs vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 2eeba1d to b815593 Compare August 9, 2023 11:42
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 21, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vrutkovs
Once this PR has been reviewed and has the lgtm label, please assign sjenning for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@vrutkovs vrutkovs force-pushed the client-view-v2 branch 2 times, most recently from 2ecd7cb to 7640d61 Compare August 21, 2023 19:24
@vrutkovs vrutkovs force-pushed the client-view-v2 branch 3 times, most recently from ffa9a66 to bc0bc03 Compare August 22, 2023 18:21
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 23, 2023

@vrutkovs: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-single-node 86d7dea link false /test e2e-aws-ovn-single-node
ci/prow/e2e-metal-ipi-ovn-ipv6 86d7dea link false /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-openstack-ovn 86d7dea link false /test e2e-openstack-ovn
ci/prow/e2e-aws-ovn-single-node-upgrade 86d7dea link false /test e2e-aws-ovn-single-node-upgrade

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2023
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 22, 2023
@openshift-merge-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci openshift-ci bot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 22, 2023
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci openshift-ci bot closed this Jan 22, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 22, 2024

@openshift-bot: Closed this PR.

Details

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants