Skip to content

Conversation

@petr-muller
Copy link
Member

@petr-muller petr-muller commented Jun 3, 2024

Replaces automated #1771 because multiple PRs were needed for OCPBUGS-33896.

Backports #1771 #1782 and #1787 to release 4.16:

  • add alerts to update health in oc adm upgrade status
  • add mock tests for alerts in oc adm upgrade status
  • OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
  • inspectalerts: use client-go wrappers for thanos call debugging
  • inspectalerts: refactor getWithBearer to try all urls in route
  • upgrade status: polish alert insights

PratikMahajan and others added 6 commits June 3, 2024 12:02
this commit adds alerts that fire during the upgrade to
`upgrade health` section.

by default all the alerts that started firing after intiating the
upgrade will appear in the upgrade health section

we also have allowed alerts that will show alerts that firing before
the upgrade was started.

We've not added examples for alerts and are just testing it with
unit tests. This is to reduce the input data in the examples.
Previously, the method iterated over URIs from a route but instead of searching for success it actually searched until first failures, which is against the point of iterating over possible URIs in the first place.

Refactor the method so that it does not return on error, and return on success instead. Only return with an error if all URIs failed to yield a workable result. Slightly optimize the error for the common case where there is only a single URI to try, and shorten the string by using a namespace/name nnotation.
- skip alerts without required labels
- add context on why we show the insight (started firing during update or is known to affect updates)
- skip alerts with info level
- explicitly mention the alert does not have a runbook
- fix `shortDuration` for more cases, including `now`, add tests
- handle also `message` annotation on alerts
@openshift-ci-robot openshift-ci-robot added jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jun 3, 2024
@openshift-ci-robot
Copy link

@petr-muller: This pull request references Jira Issue OCPBUGS-33917, which is valid.

7 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
  • release note type set to "Release Note Not Required"
  • dependent bug Jira Issue OCPBUGS-33896 is in the state Verified, which is one of the valid states (MODIFIED, ON_QA, VERIFIED)
  • dependent Jira Issue OCPBUGS-33896 targets the "4.17.0" version, which is one of the valid target versions: 4.17.0
  • bug has dependents

Requesting review from QA contact:
/cc @evakhoni

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

Backports #1771 #1782 and #1787 to release 4.16:

  • add alerts to update health in oc adm upgrade status
  • add mock tests for alerts in oc adm upgrade status
  • OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
  • inspectalerts: use client-go wrappers for thanos call debugging
  • inspectalerts: refactor getWithBearer to try all urls in route
  • upgrade status: polish alert insights

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 3, 2024

@petr-muller: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-metal-ipi-ovn-ipv6 3955f41 link false /test e2e-metal-ipi-ovn-ipv6

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@petr-muller
Copy link
Member Author

/cc @PratikMahajan @wking @jan--f
/uncc @deads2k

@openshift-ci openshift-ci bot requested review from PratikMahajan, jan--f and wking and removed request for deads2k June 3, 2024 15:07
Copy link
Member

@wking wking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean backports; nothing but some context difference vs. the dev-branch changes:

$ cherry-pick-diff origin/release-4.16..origin/pr/1794 origin/master

7515161 -> 3955f41 upgrade status: polish alert insights
51487f9 -> b2c1189 inspectalerts: refactor getWithBearer to try all urls in route
2715cc4 -> 579e757 inspectalerts: use client-go wrappers for thanos call debugging
518c0e7 -> c3cee5d OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
d5f52ff -> 48c7595 add mock tests for alerts in oc adm upgrade status
10c41f5 -> 8b0dc50 add alerts to update health in oc adm upgrade status

751516199 -> 3955f41d7 `upgrade status`: polish alert insights
--- 751516199
+++ 3955f41d7
@@ -11,3 +11,3 @@
 diff --git a/pkg/cli/admin/upgrade/status/alerts.go b/pkg/cli/admin/upgrade/status/alerts.go
-index cf66178d9..a9387b865 100644
+index 503c4ebf9..bc690baec 100644
 --- a/pkg/cli/admin/upgrade/status/alerts.go
@@ -23,3 +23,3 @@
 @@ -57,27 +58,72 @@ func parseAlertDataToInsights(alertData AlertData, startedAt time.Time) []update
-       var updateInsights []updateInsight
+       var updateInsights []updateInsight = []updateInsight{}
  
@@ -102,3 +102,3 @@
 diff --git a/pkg/cli/admin/upgrade/status/alerts_test.go b/pkg/cli/admin/upgrade/status/alerts_test.go
-index db68ab2a0..654976165 100644
+index bb79a3e8b..be25cd932 100644
 --- a/pkg/cli/admin/upgrade/status/alerts_test.go

d5f52ff69 -> 48c75952f add mock tests for alerts in oc adm upgrade status
--- d5f52ff69
+++ 48c75952f
@@ -269,6 +269,6 @@
 diff --git a/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.detailed-output b/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.detailed-output
-index 6225bf52a..a7381ef27 100644
+index 6c3c5bcc0..1c6d86694 100644
 --- a/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.detailed-output
 +++ b/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.detailed-output
-@@ -33,3 +33,21 @@ Message: Cluster Operator machine-config is unavailable (MachineConfigController
+@@ -34,3 +34,21 @@ Message: Cluster Operator machine-config is unavailable (MachineConfigController
    Resources:
@@ -295,6 +295,6 @@
 diff --git a/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.output b/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.output
-index 5289e9e55..c40a74b8c 100644
+index 155bfa393..f230be862 100644
 --- a/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.output
 +++ b/pkg/cli/admin/upgrade/status/examples/4.15.0-ec2-unavailable-mco-20m.output
-@@ -25,7 +25,9 @@ ip-10-0-4-159.us-east-2.compute.internal    Outdated     Pending   4.14.0-rc.3
+@@ -26,7 +26,9 @@ ip-10-0-4-159.us-east-2.compute.internal    Outdated     Pending   4.14.0-rc.3
  ip-10-0-99-40.us-east-2.compute.internal    Outdated     Pending   4.14.0-rc.3   ?     
@@ -311,3 +311,3 @@
 diff --git a/pkg/cli/admin/upgrade/status/status.go b/pkg/cli/admin/upgrade/status/status.go
-index ad24f7e14..d79113d0f 100644
+index b54e530b3..a56e60459 100644
 --- a/pkg/cli/admin/upgrade/status/status.go
@@ -322,3 +322,3 @@
  
-@@ -283,7 +284,6 @@ func (o *options) Run(ctx context.Context) error {
+@@ -280,7 +281,6 @@ func (o *options) Run(ctx context.Context) error {
                if err := json.Unmarshal(alertBytes, &alertData); err != nil {

10c41f526 -> 8b0dc5057 add alerts to update health in oc adm upgrade status
--- 10c41f526
+++ 8b0dc5057
@@ -277,3 +277,3 @@
 diff --git a/pkg/cli/admin/upgrade/status/health.go b/pkg/cli/admin/upgrade/status/health.go
-index ef047b0c4..07e895352 100644
+index 8d77250b6..24c23b5e4 100644
 --- a/pkg/cli/admin/upgrade/status/health.go
@@ -309,3 +309,3 @@
 diff --git a/pkg/cli/admin/upgrade/status/status.go b/pkg/cli/admin/upgrade/status/status.go
-index 0612a0745..ad24f7e14 100644
+index 82d0de1c4..b54e530b3 100644
 --- a/pkg/cli/admin/upgrade/status/status.go
@@ -371,3 +371,3 @@
                if err != nil {
-@@ -245,6 +266,27 @@ func (o *options) Run(ctx context.Context) error {
+@@ -242,6 +263,27 @@ func (o *options) Run(ctx context.Context) error {
        }

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 3, 2024
@evakhoni
Copy link

evakhoni commented Jun 3, 2024

pre-merge verified
/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Jun 3, 2024
@petr-muller
Copy link
Member Author

@evakhoni can you slap the cherry-pick-approved here too?

@evakhoni
Copy link

evakhoni commented Jun 3, 2024

np.
/label cherry-pick-approved

@openshift-ci openshift-ci bot added the cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. label Jun 3, 2024
@soltysh
Copy link
Contributor

soltysh commented Jun 4, 2024

/label backport-risk-assessed

@openshift-ci openshift-ci bot added the backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. label Jun 4, 2024
Copy link

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@simonpasquier
Copy link

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, simonpasquier, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jun 4, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: petr-muller, simonpasquier, wking

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit d9c3227 into openshift:release-4.16 Jun 4, 2024
@openshift-ci-robot
Copy link

@petr-muller: Jira Issue OCPBUGS-33917: All pull requests linked via external trackers have merged:

Jira Issue OCPBUGS-33917 has been moved to the MODIFIED state.

Details

In response to this:

Replaces automated #1771 because multiple PRs were needed for OCPBUGS-33896.

Backports #1771 #1782 and #1787 to release 4.16:

  • add alerts to update health in oc adm upgrade status
  • add mock tests for alerts in oc adm upgrade status
  • OCPBUGS-33896: status/inspect-alerts: handle non-200 by Thanos
  • inspectalerts: use client-go wrappers for thanos call debugging
  • inspectalerts: refactor getWithBearer to try all urls in route
  • upgrade status: polish alert insights

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@petr-muller petr-muller deleted the cherry-pick-1740-1787-1782-to-release-4.16 branch June 4, 2024 18:21
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-tools-container-v4.16.0-202406042036.p0.gd9c3227.assembly.stream.el9 for distgit ose-tools.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. backport-risk-assessed Indicates a PR to a release branch has been evaluated and considered safe to accept. cherry-pick-approved Indicates a cherry-pick PR into a release branch has been approved by the release branch manager. jira/severity-important Referenced Jira bug's severity is important for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.