OCPBUGS-38859: add api-unreachable-from-client monitor test #29003
OCPBUGS-38859: add api-unreachable-from-client monitor test #29003openshift-merge-bot[bot] merged 2 commits intoopenshift:masterfrom
Conversation
|
Job Failure Risk Analysis for sha: 859c24a
|
62869ae to
386882c
Compare
|
Job Failure Risk Analysis for sha: 386882c
|
| query: &metrics.PrometheusQueryRunner{ | ||
| Client: client, | ||
| QueryString: `sum(rate(rest_client_requests_total{code="<error>"}[1m])) by(host)`, | ||
| Step: time.Minute, |
There was a problem hiding this comment.
Is a minute granular enough to do the debugging necessary for problems this will expose? We're normally working in seconds, and IIRC our scape interval is lower than 1 min, 15 or 30s I thought.
There was a problem hiding this comment.
I tried to explain this in the godoc:
The intervals are scraped from metrics, so they don't have the same granularity, as other intervals, since:
a) in OpenShift, metrics are scraped every 30s
b) for rate to be calculated, we need at lease two samples
If an api unreachable interval overlaps with an apiserver shutdown window, it is typically indicative of network issues at the load balancer layer. Since the intervals are grouped by host, we can also narrow it down to a particular host, for example, we have seen cases where connections over internal load balancer to be faulty at times while the service network operated just fine.
Let me know your thoughts.
| endTime := end.Timestamp.Time() | ||
| if start == end { | ||
| // a disruption window with one sample | ||
| endTime = end.Timestamp.Time().Add(time.Minute) |
There was a problem hiding this comment.
Per above Is a minute correct here? More granular would be good. We typically add a second when we're forcing a point in time event to appear in the chart.
There was a problem hiding this comment.
since we scrape every 30s, a 1s granularity is not very meaningful here, i think. I am approximating the interval to [t-30s ... t+30s] for now. Thoughts?
There was a problem hiding this comment.
Ah that makes sense, thanks. Fine as is then.
| }, | ||
| analyzer: metrics.RateSeriesAnalyzer{}, | ||
| builder: &intervalBuilder{ | ||
| serviceNetworkIP: kubeSvc.Spec.ClusterIP, |
There was a problem hiding this comment.
could we use the default kube service instead ? (kubernetes.default.svc)
There was a problem hiding this comment.
not clear what you mean, I believe we are using the cluster IP of kubernetes.default.svc
pkg/monitortests/metrics/metrics.go
Outdated
| return nil, fmt.Errorf("prometheus query %q returned error: %v", q.QueryString, err) | ||
| } | ||
| if len(warnings) > 0 { | ||
| framework.Logf("query %q #### warnings \\n\\t%v\\n\", strings.Join(warningsForQuery, \"\\n\\t\"", q.QueryString, warnings) |
There was a problem hiding this comment.
I think that strings.Join(warningsForQuery should be defined outside of the quotation marks :)
There was a problem hiding this comment.
hehe, thanks for pointing it out, the copy paste escaped the quotes and it has evaded my eyes :)
| } | ||
| to = ¤t | ||
| } | ||
| // is the entire range is a disruption? |
There was a problem hiding this comment.
typo, change to is the entire range a disruption?
| defer callback.EndSeries() | ||
|
|
||
| var from, to *prometheustypes.SamplePair | ||
| for _, current := range series.Values { |
There was a problem hiding this comment.
does the range creates the current var once and then reuses it ?
if yes, then we cannot simply store ¤t as it will point to the same variable.
There was a problem hiding this comment.
I think the golang team fixed it (starting with 1.22) - moved the loop scope variable to iteration scoped - https://go.dev/blog/loopvar-preview. That's why the tests passed. I made it a iteration scoped variable.
| } | ||
|
|
||
| zero := prometheustypes.SampleValue(0) | ||
| matrix := result.(prometheustypes.Matrix) |
There was a problem hiding this comment.
should we check for nil/empty responses ?
why not to do a type assertion ? (not sure if line 19 is enough)
There was a problem hiding this comment.
I think we are fine, we check for nil and then make sure the type is right. prometheustypes.Matrix implements the Value interface
func (Matrix) Type() ValueType { return ValMatrix }
| } | ||
| func (b *intervalBuilder) EndSeries() { b.locator = monitorapi.Locator{} } | ||
|
|
||
| func (b *intervalBuilder) NewDisruptionInterval(metric prometheustypes.Metric, start, end *prometheustypes.SamplePair) { |
There was a problem hiding this comment.
should we check if start and end are not nil ?
same for metric which is a map I think ?
There was a problem hiding this comment.
I don't think it's necessary, the analyzer invokes NewInterval with a non nil start, and end. But if you want i can add nil check
|
@dgoodwin a new screenshot of the intervals in the UI We can see client talking to the kube-apiserver using the internal load balancer is experiencing error and that coincides with apiserver shutdown intervals |
|
@tkashem: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
This all looks ok to me. You'll need a jira in title, I'll approve in case there's more you want to sort out with @p0lyn0mial . /approve |
|
@tkashem: This pull request references Jira Issue OCPBUGS-37862, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@tkashem: This pull request references Jira Issue OCPBUGS-38859, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/retest-required |
|
Job Failure Risk Analysis for sha: 27b70a3
|
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dgoodwin, p0lyn0mial, tkashem The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
it needs an |
|
/label acknowledge-critical-fixes-only (it's informing only, does not fail any test, it will help with the load balancer issues we are seeing in CI) |
|
@tkashem: Jira Issue OCPBUGS-38859: All pull requests linked via external trackers have merged: Jira Issue OCPBUGS-38859 has been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[ART PR BUILD NOTIFIER] Distgit: openshift-enterprise-tests |



This adds a new timeline
api-uneachable, grouped by source:{internal-lb|service-network|external-lb|localhost}. It scrapes therest_client_requests_totalmetricsThe number of timelines in the UI is bound by
source, so it is a limited number.The following shows the disruption intervals where clients observed api errors via
api-intFrom https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/29003/pull-ci-openshift-origin-master-e2e-aws-ovn-kube-apiserver-rollout/1824519484648460288
BTW, looks like the source
webhook.openshift-console-operator.svc:9443shows permanent error, looks to be a bad conversion webhook?Looks like they removed some references of the bad conversion webhook from their crd schema 1. but I still see reference here 2.