Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 6, 2021

With the recent increase in cluster metrics, some disruptive tests
can trigger errors that result in a burst of
cluster_operator_conditions or alerts series that then clear after
the disruption. We want to run the full suite after we run a
disruption, and in general we are concerned with average over max,
so shorten the interval we check to 1h and calculate the average.

When looking at telemetry from 4.7 CI clusters, the disruptive tests
BRIEFLY peak at 600 series and then fall to 300 almost immediately
after. Using the average, the total count is closer to 400 over the
hour the tests run and that better represents the desired goal of
the test (to limit average load, not spikes). Check the maximum as
double the average.

Resolves failures encountered when attempting to run the disruptive
suite (destroy the cluster and recover) and then the conformance
suite. Subsequent PR will remove the skip on disruptive

@marun, @lilic

With the recent increase in cluster metrics, some disruptive tests
can trigger errors that result in a burst of
cluster_operator_conditions or alerts series that then clear after
the disruption. We want to run the full suite after we run a
disruption, and in general we are concerned with average over max,
so shorten the interval we check to 1h and calculate the average.

When looking at telemetry from 4.7 CI clusters, the disruptive tests
BRIEFLY peak at 600 series and then fall to 300 almost immediately
after.  Using the average, the total count is closer to 400 over the
hour the tests run and that better represents the desired goal of
the test (to limit average load, not spikes). Check the maximum as
double the average.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2021
@marun
Copy link
Contributor

marun commented Jan 6, 2021

/lgtm

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2021

@smarterclayton: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic-cmd 25a026e link /test e2e-agnostic-cmd

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 8ca3f31 into openshift:master Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants