Skip to content

Conversation

@smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Jan 6, 2021

A number of cluster invariants check without time boundaries - for
instance, kube-apiserver hasn't failed to gracefully restart from all
events stored in the api, or prometheus not reporting any alerts since the
start of the test. However, when these tests run after induced disruption
(like recovering a master or a restored cluster) the test would then fail
because it sees the disruption. The current behavior is useful for
assessing install, but can't scale to disruption events.

Instead of tests hardcoding arbitrary intervals, standardize the lookback
window into a pair of utility functions, one for the complete valid range,
and one for a more limited "look back a reasonable period of time". Allow
the test invoker to pass an environment variable
TEST_LIMIT_START_TIME=<unix_timestamp_in_seconds_since_epoch> that will
automatically constrain how far back tests look. This allows the tests to
continue to observe installation failures by default, and for disruption
suites to pass the time after the disruption to the test suite to limit how
far the lookback extends.

Update all arbitrary prometheus range queries to use the "reasonable"
window (1h by default) that can be clamped or extended by the start time.

@marun, @sttts

Includes #25783 because I also changed that code and then wanted to parameterize it here.

With the recent increase in cluster metrics, some disruptive tests
can trigger errors that result in a burst of
cluster_operator_conditions or alerts series that then clear after
the disruption. We want to run the full suite after we run a
disruption, and in general we are concerned with average over max,
so shorten the interval we check to 1h and calculate the average.

When looking at telemetry from 4.7 CI clusters, the disruptive tests
BRIEFLY peak at 600 series and then fall to 300 almost immediately
after.  Using the average, the total count is closer to 400 over the
hour the tests run and that better represents the desired goal of
the test (to limit average load, not spikes). Check the maximum as
double the average.
A number of cluster invariants check without time boundaries - for
instance, kube-apiserver hasn't failed to gracefully restart from
all events stored in the api, or prometheus not reporting any alerts
since the start of the test. However, when these tests run after
induced disruption (like recovering a master or a restored cluster)
the test would then fail because it sees the disruption. The
current behavior is useful for assessing install, but can't scale
to disruption events.

Instead of tests hardcoding arbitrary intervals, standardize the
lookback window into a pair of utility functions, one for the
complete valid range, and one for a more limited "look back a reasonable
period of time". Allow the test invoker to pass an environment
variable TEST_LIMIT_START_TIME=<unix_timestamp_in_seconds_since_epoch>
that will automatically constrain how far back tests look. This allows
the tests to continue to observe installation failures by default, and
for disruption suites to pass the time after the disruption to the
test suite to limit how far the lookback extends.

Update all arbitrary prometheus range queries to use the "reasonable"
window (1h by default) that can be clamped or extended by the start
time.
@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 6, 2021
@marun
Copy link
Contributor

marun commented Jan 6, 2021

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2021
@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: marun, smarterclayton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@smarterclayton
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 7, 2021

@smarterclayton: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-agnostic-cmd ace1345 link /test e2e-agnostic-cmd

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit ad9e07a into openshift:master Jan 7, 2021
wking added a commit to wking/origin that referenced this pull request Feb 22, 2021
We're doing better in updates now, and want to ratchet down to bar
critical-alert noise during updates.  The old 1m
alertPeriodCheckMinutes landed with this test in 3b8cb3c (Add CI
test to check for crit alerts post upgrade, 2020-03-27, openshift#24786).

DurationSinceStartInSeconds, which I'm using now, landed in ace1345
(test: Allow tests that check invariants over time to be constrained,
2021-01-06, openshift#25784).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants