add event interval from rest client metric broken down by source#27986
add event interval from rest client metric broken down by source#27986vrutkovs wants to merge 2 commits intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
|
/test e2e-aws-ovn-single-node-upgrade |
|
@vrutkovs could you perhaps enlightened me as to what is missing from this PR to still be in draft? |
|
I'm experimenting with query range to see if we can get disruptions for each second from metrics, which are scraped every 30 sec. Not sure if its possible at all, I'll add a TODO |
d64d51d to
2672967
Compare
|
/cc @dinhxuanvu @mfojtik Lets merge it today so that by mid-next week we'd see if that gives valid signal |
|
/retest-required |
4f14eab to
87baeea
Compare
|
@tkashem PTAL |
| { | ||
| "level": "Info", | ||
| "locator": "ns/e2e-kubectl-3271 pod/without-label uid/e185b70c-ea3e-4600-850a-b2370a729a73 container/without-label", | ||
| "message": "constructed/pod-lifecycle-constructor reason/ContainerWait missed real \"ContainerWait\"", |
There was a problem hiding this comment.
question: are these related to this PR?
pkg/monitor/clientview/gatherer.go
Outdated
| message := fmt.Sprintf("client observed an API error - %s", series.Metric.String()) | ||
| intervalsCount := int(current.Value) - int(previous) | ||
| if intervalsCount > 1 { | ||
| message = fmt.Sprintf("%s (%d times)", message, intervalsCount) |
There was a problem hiding this comment.
I didn't see (%d times) in the messages https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27986/pull-ci-openshift-origin-master-e2e-aws-ovn-upgrade/1683745727487938560, I tried this int(current.Value) - int(previous) in my original PR and it caused an overflow and the result was negative in some cases, i did not have time to debug it.
also, counter can reset to zero, we need to handle it, if we want to display the increment correctly, right?
There was a problem hiding this comment.
Oh, okay, I haven't seen the counter being reset (it may happen on apiserver restart I guess?). I'll probably revert this commit then
|
/close |
|
@dinhxuanvu: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/reopen |
|
@dinhxuanvu: Reopened this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/retest-required |
They are back! |
| Locator: interval.Locator, | ||
| Message: interval.Message, | ||
| Locator: html.EscapeString(interval.Locator), | ||
| Message: html.EscapeString(interval.Message), |
There was a problem hiding this comment.
Suspect this will break making things beyond your PR and even with your intervals it comes out problematic:
"message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}",
This ceases to be readable, but there is hope. David and I are in the midst of making intervals more structured, we're only part way through but the features are available for you to start to use, and I think in this case it will avoid the mess above. I'll do add details on how in the place where you construct your interval.
| From: current.Timestamp.Time(), | ||
| // TODO: find how long did the requests took using data from rest_client_request_duration_seconds_sum? | ||
| To: current.Timestamp.Time().Add(time.Second), | ||
| } |
There was a problem hiding this comment.
Per below this is a prime use case for structured intervals:
Currently you're serializing like this:
{
"level": "Error",
"locator": "client/APIError source/service-network node/10.129.0.40 namespace/openshift-console-operator component/openshift-console-operator",
"message": "client observed an API error - previous=1 current=2 rest_client_requests_total{code=\u0026#34;500\u0026#34;, endpoint=\u0026#34;https\u0026#34;, host=\u0026#34;172.30.0.1:443\u0026#34;, instance=\u0026#34;10.129.0.40:8443\u0026#34;, job=\u0026#34;metrics\u0026#34;, method=\u0026#34;GET\u0026#34;, namespace=\u0026#34;openshift-console-operator\u0026#34;, pod=\u0026#34;console-operator-7568df6578-d8zrs\u0026#34;, prometheus=\u0026#34;openshift-monitoring/k8s\u0026#34;, service=\u0026#34;metrics\u0026#34;}",
"tempStructuredLocator": {
"type": "",
"keys": null
},
"tempStructuredMessage": {
"reason": "",
"cause": "",
"humanMessage": "",
"annotations": null
},
"from": "2023-08-07T21:49:25Z",
"to": "2023-08-07T21:49:26Z"
},Intervals can be created with
monitorapi.NewInterval(monitorapi.SourceRESTClientMonitor, monitorapi.Warning).
Locator(monitorapi.NewLocator().RestAPIError([yourparamsforlocator])).
Message(msg.HumanMessage("client observed an API error")).Display().Build()You'll want a locator type for your new client API Errors, a locator constructor to create it, and then use message annotations for previous, current count.
Is it necessary to have the full promql included? If so, maybe in it's own message annotation, with the escaping done here, but the problem is how do we know when to unescape? Does it serialize ok normally without the html escaping if we put it into a structured field?
| } | ||
| previous := series.Values[0].Value | ||
| for _, current := range series.Values[1:] { | ||
| if !previous.Equal(current.Value) { |
There was a problem hiding this comment.
Am I correct that if we're experiencing client errors over a prolonged period of time, every metric sample that we increment results in it's own interval? How far apart are the timestamps on each value? Every 15s?
I'm concerned about bounding here, if a bad job is going to generate a million intervals? That would break a bunch of things and cost us money likely too.
For example with disruption we record a single interval for a prolonged period of disruption, we record when we start seeing disruption (and what error message), and then watch for it to stop or for the error message to change. Either event results in the interval now having a To time, so we create it.
IMO you should apply similar here such that a prolonged uninterrupted period of is just one interval. Granted this may get complicated, maybe we have one value where it doesn't change and then it starts going up again, that would make a new interval, but that would be better than nothing. Depends on how far apart the prometheus samples are as well.
There was a problem hiding this comment.
every metric sample that we increment results in it's own interval?
No, not exactly. We're going to create one interval for every scrape period (30s) if number of client errors has increased.
I'm concerned about bounding here, if a bad job is going to generate a million intervals?
No, one interval every 30s at most.
There was a problem hiding this comment.
Ok one ever 30s, but is that also per time series returned by your promql? (it looks like that might catch a fair bit) Example if that promql returns 500 time series and we have a bad job run with a 30min problem, thats 500 * 30 * 2 or 30k intervals, which would double the number I'd expect normally.
If the error counter for a time series increments at t, t+30s, t+60s, t+90s, I think it's probably worth the effort to make that one interval as we do for disruption. That would involve tracking start points, watching for a sample that didn't increment and using that as a trigger to terminate that interval and add it to the list, and terminating any intervals that were dangling at the end of the job run.
There was a problem hiding this comment.
Its per time series, yes, but I don't expect it to be significant. Its "error code" * "source namespace", so in the worst case its 10 error codes * 50 namespaces * 3 ways to reach API * (30 mins job / 30 secs interval) = 750 intervals for "every component firing 10 different errors throught the whole test duration" case
There was a problem hiding this comment.
Unless I'm missing something, 10 error codes x 50 namespaces x 3 api points x 60 (30 minutes at 30s intervals) = 90,000, not 750? I know it's an extreme example by even 10% of that is a lot more for our JS UI to handle. I think you need to batch based on consecutive failed samples.
There was a problem hiding this comment.
Ah, you're correct. Right, this needs interval batching
There was a problem hiding this comment.
Implemented, now sequential intervals are merged into one:
Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] adding new interval Aug 23 08:36:58.784 - 1s E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"}
...
Aug 23 08:39:20.111: INFO: [client-rest-error-serializer] updated existing interval Aug 23 08:36:58.784 - 30s E client/APIError source/service-network node/10.128.0.197 namespace/openshift-apiserver component/openshift-apiserver client observed an API error - previous=1 current=5 rest_client_requests_total{apiserver="openshift-apiserver", code="503", container="openshift-apiserver", endpoint="https", host="172.30.0.1:443", instance="10.128.0.197:8443", job="api", method="GET", namespace="openshift-apiserver", pod="apiserver-5785c87bf8-9skb4", prometheus="openshift-monitoring/k8s", service="api"}
|
|
||
| func (w *alertSummarySerializer) CollectData(ctx context.Context, storageDir string, beginning, end time.Time) (monitorapi.Intervals, []*junitapi.JUnitTestCase, error) { | ||
| intervals, err := fetchEventIntervalsForAllAlerts(ctx, w.adminRESTConfig, beginning) | ||
| clientEventIntervals, err2 := clientview.FetchEventIntervalsForRestClientError(ctx, w.adminRESTConfig, beginning) |
There was a problem hiding this comment.
These are not conceptually linked to the alert analyzer tests right? How hard is it to break these out into their own InvariantTest? You just need an adminRESTConfig, should be pretty easy.
There was a problem hiding this comment.
Right, its best to be moved its own InvariantTest
2eeba1d to
b815593
Compare
b815593 to
27959cb
Compare
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
2ecd7cb to
7640d61
Compare
ffa9a66 to
bc0bc03
Compare
bc0bc03 to
86d7dea
Compare
|
@vrutkovs: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Stale issues rot after 30d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle rotten |
|
Rotten issues close after 30d of inactivity. Reopen the issue by commenting /close |
|
@openshift-bot: Closed this PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Make sure locator and message are shown as is on HTML page when
rest_increases.Test bed for #27976 + fixes
TODO:
rest_client_request_duration_seconds_sum?Currently we'll create sequential one second intervals for each error found, but exact time and duration of the disruption cannot be properly derived