Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Aug 14, 2019

This will allow us to discover upgrade and other failure reasons without having to resort to a must-gather or similar.

Stick this in cluster_operator_conditions, since we already have a reason slot there. I don't see a reason to add a new metric to separate cluster-version operator failures from second-level operator failures; the name should be sufficient for that.

@openshift-ci-robot openshift-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 14, 2019
@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch 2 times, most recently from c3c9279 to 4cc5a4d Compare August 14, 2019 21:02
@wking
Copy link
Member Author

wking commented Aug 14, 2019

This may overlap with #232 as a way to get ClusterVersion failure reasons out into Telemetry. Are the parallel tracks (alerts and cluster_operator_conditions metrics) a problem? Do we want to consolidate on alerts and drop cluster_operator_conditions altogether? Or do we want both for a belt-and-suspenders approach, or just because each channel gives us slightly different information (e.g. cluster_operator_conditions may make it easier to get failure rates, because we still push the metrics, with a zero value, when the condition is not active).

@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch from 4cc5a4d to 9277af9 Compare August 14, 2019 21:29
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Aug 14, 2019
@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch from 9277af9 to eb7ff9b Compare August 14, 2019 21:57
@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Aug 14, 2019
@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch from eb7ff9b to b69cd48 Compare August 14, 2019 22:30
@abhinavdahiya
Copy link
Contributor

we should also create a bug for this.

@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch 2 times, most recently from d881086 to 240244c Compare August 15, 2019 17:15
@wking
Copy link
Member Author

wking commented Aug 15, 2019

/retitle Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

@openshift-ci-robot openshift-ci-robot changed the title pkg/cvo/metrics: Report cluster-version failing reasons Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons Aug 15, 2019
@openshift-ci-robot openshift-ci-robot added the bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. label Aug 15, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references an invalid Bugzilla bug:

  • expected the bug to target the "4.2.0" release, but it targets "---" instead

Comment /bugzilla refresh to re-evaluate validity if changes to the Bugzilla bug are made, or edit the title of this pull request to link to a different bug.

Details

In response to this:

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This will allow us to discover upgrade and other failure reasons
without having to resort to a must-gather or similar [1].  And also to
look at any other version conditions in Telemetry.

Stick this in cluster_operator_conditions, since we already have a
'reason' slot there.  And ClusterVersion.Status.Conditions is pretty
much the same thing as ClusterOperator.Status.Conditions; we'll want
to see all of those.  I don't see a reason to add a new metric to
separate cluster-version operator failures from second-level operator
failures; the name should be sufficient for that.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1741645
@wking wking force-pushed the metrics-for-cluster-version-failing-reason branch from 240244c to 6861c48 Compare August 15, 2019 17:27
@wking
Copy link
Member Author

wking commented Aug 15, 2019

/bugzilla refresh

@openshift-ci-robot openshift-ci-robot added bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. and removed bugzilla/invalid-bug Indicates that a referenced Bugzilla bug is invalid for the branch this PR is targeting. labels Aug 15, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: This pull request references a valid Bugzilla bug. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/bugzilla refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@abhinavdahiya
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2019
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [abhinavdahiya,wking]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@wking
Copy link
Member Author

wking commented Aug 15, 2019

e2e-aws:

level=error msg="Error: NoSuchBucket: The specified bucket does not exist"
level=error msg="\tstatus code: 404, request id: CE2EC2492682F231, host id: Hzp7AoarEa//ZVR8XEoblAxIFZ2A8AipR3lyGzeCjGxO69/oSHeCNZap3tbF2g/T+mdcd/xmEzo="

Haven't seen that one recently...

/test e2e-aws

@wking
Copy link
Member Author

wking commented Aug 15, 2019

e2e-aws:

Aug 15 20:25:46.283: INFO: Couldn't delete ns: "e2e-svcaccounts-2403": namespace e2e-svcaccounts-2403 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed (&errors.errorString{s:"namespace e2e-svcaccounts-2403 was not deleted with limit: timed out waiting for the condition, namespace is empty but is not yet removed"}) 

That's rhbz#1727090.

/test e2e-aws

@openshift-merge-robot openshift-merge-robot merged commit e47c778 into openshift:master Aug 15, 2019
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. The Bugzilla bug has been moved to the MODIFIED state.

Details

In response to this:

Bug 1741645: pkg/cvo/metrics: Report cluster-version conditions with reasons

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@wking
Copy link
Member Author

wking commented Aug 15, 2019

/cherrypick release-4.1

@openshift-cherrypick-robot

@wking: new pull request created: #237

Details

In response to this:

/cherrypick release-4.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants