Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons #236

wking · 2019-08-14T20:59:14Z

This will allow us to discover upgrade and other failure reasons without having to resort to a must-gather or similar.

Stick this in cluster_operator_conditions, since we already have a reason slot there. I don't see a reason to add a new metric to separate cluster-version operator failures from second-level operator failures; the name should be sufficient for that.

wking · 2019-08-14T21:24:53Z

This may overlap with #232 as a way to get ClusterVersion failure reasons out into Telemetry. Are the parallel tracks (alerts and cluster_operator_conditions metrics) a problem? Do we want to consolidate on alerts and drop cluster_operator_conditions altogether? Or do we want both for a belt-and-suspenders approach, or just because each channel gives us slightly different information (e.g. cluster_operator_conditions may make it easier to get failure rates, because we still push the metrics, with a zero value, when the condition is not active).

pkg/cvo/metrics.go

abhinavdahiya · 2019-08-14T23:58:22Z

we should also create a bug for this.

wking · 2019-08-15T17:26:25Z

/retitle Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

openshift-ci-robot · 2019-08-15T17:26:28Z

@wking: This pull request references an invalid Bugzilla bug:

expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This will allow us to discover upgrade and other failure reasons without having to resort to a must-gather or similar [1]. And also to look at any other version conditions in Telemetry. Stick this in cluster_operator_conditions, since we already have a 'reason' slot there. And ClusterVersion.Status.Conditions is pretty much the same thing as ClusterOperator.Status.Conditions; we'll want to see all of those. I don't see a reason to add a new metric to separate cluster-version operator failures from second-level operator failures; the name should be sufficient for that. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741645

wking · 2019-08-15T17:28:34Z

/bugzilla refresh

openshift-ci-robot · 2019-08-15T17:28:39Z

@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

abhinavdahiya · 2019-08-15T17:49:39Z

/lgtm

openshift-ci-robot · 2019-08-15T17:49:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wking · 2019-08-15T18:25:00Z

e2e-aws:

level=error msg="Error: NoSuchBucket: The specified bucket does not exist"
level=error msg="\tstatus code: 404, request id: CE2EC2492682F231, host id: Hzp7AoarEa//ZVR8XEoblAxIFZ2A8AipR3lyGzeCjGxO69/oSHeCNZap3tbF2g/T+mdcd/xmEzo="

Haven't seen that one recently...

/test e2e-aws

wking · 2019-08-15T21:29:59Z

e2e-aws:

Aug 15 20:25:46.283: INFO: Couldn't delete ns: "e2e-svcaccounts-2403": namespace e2e-svcaccounts-2403 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-svcaccounts-2403 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"})

That's rhbz#1727090.

/test e2e-aws

openshift-ci-robot · 2019-08-15T22:44:29Z

@wking: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

Details

In response to this:

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

wking · 2019-08-15T22:44:35Z

/cherrypick release-4.1

openshift-cherrypick-robot · 2019-08-15T22:44:43Z

@wking: new pull request created: #237

Details

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot requested review from abhinavdahiya and crawford August 14, 2019 20:59

openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 14, 2019

wking force-pushed the metrics-for-cluster-version-failing-reason branch 2 times, most recently from c3c9279 to 4cc5a4d Compare August 14, 2019 21:02

wking force-pushed the metrics-for-cluster-version-failing-reason branch from 4cc5a4d to 9277af9 Compare August 14, 2019 21:29

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 14, 2019

wking force-pushed the metrics-for-cluster-version-failing-reason branch from 9277af9 to eb7ff9b Compare August 14, 2019 21:57

openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 14, 2019

wking force-pushed the metrics-for-cluster-version-failing-reason branch from eb7ff9b to b69cd48 Compare August 14, 2019 22:30

abhinavdahiya reviewed Aug 14, 2019

View reviewed changes

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

abhinavdahiya reviewed Aug 14, 2019

View reviewed changes

pkg/cvo/metrics.go Outdated Show resolved Hide resolved

wking force-pushed the metrics-for-cluster-version-failing-reason branch 2 times, most recently from d881086 to 240244c Compare August 15, 2019 17:15

openshift-ci-robot changed the title ~~pkg/cvo/metrics: Report cluster-version failing reasons~~ Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons Aug 15, 2019

openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 15, 2019

wking force-pushed the metrics-for-cluster-version-failing-reason branch from 240244c to 6861c48 Compare August 15, 2019 17:27

openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 15, 2019

openshift-ci-robot assigned abhinavdahiya Aug 15, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2019

openshift-merge-robot merged commit e47c778 into openshift:master Aug 15, 2019

wking deleted the metrics-for-cluster-version-failing-reason branch August 15, 2019 22:44

openshift-cherrypick-robot mentioned this pull request Aug 15, 2019

Bug 1741661: pkg/cvo/metrics: Report cluster-version conditions with reasons #237

Merged

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons #236

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons #236

Uh oh!

Conversation

wking commented Aug 14, 2019

Uh oh!

wking commented Aug 14, 2019

Uh oh!

Uh oh!

Uh oh!

abhinavdahiya commented Aug 14, 2019

Uh oh!

wking commented Aug 15, 2019

Uh oh!

openshift-ci-robot commented Aug 15, 2019

Uh oh!

wking commented Aug 15, 2019

Uh oh!

openshift-ci-robot commented Aug 15, 2019

Uh oh!

abhinavdahiya commented Aug 15, 2019

Uh oh!

openshift-ci-robot commented Aug 15, 2019

Uh oh!

wking commented Aug 15, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Aug 15, 2019

Uh oh!

openshift-ci-robot commented Aug 15, 2019

Uh oh!

wking commented Aug 15, 2019

Uh oh!

openshift-cherrypick-robot commented Aug 15, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wking commented Aug 15, 2019 •

edited

Loading