-
Notifications
You must be signed in to change notification settings - Fork 4.8k
test: Allow tests that check invariants over time to be constrained #25784
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test: Allow tests that check invariants over time to be constrained #25784
Conversation
With the recent increase in cluster metrics, some disruptive tests can trigger errors that result in a burst of cluster_operator_conditions or alerts series that then clear after the disruption. We want to run the full suite after we run a disruption, and in general we are concerned with average over max, so shorten the interval we check to 1h and calculate the average. When looking at telemetry from 4.7 CI clusters, the disruptive tests BRIEFLY peak at 600 series and then fall to 300 almost immediately after. Using the average, the total count is closer to 400 over the hour the tests run and that better represents the desired goal of the test (to limit average load, not spikes). Check the maximum as double the average.
A number of cluster invariants check without time boundaries - for instance, kube-apiserver hasn't failed to gracefully restart from all events stored in the api, or prometheus not reporting any alerts since the start of the test. However, when these tests run after induced disruption (like recovering a master or a restored cluster) the test would then fail because it sees the disruption. The current behavior is useful for assessing install, but can't scale to disruption events. Instead of tests hardcoding arbitrary intervals, standardize the lookback window into a pair of utility functions, one for the complete valid range, and one for a more limited "look back a reasonable period of time". Allow the test invoker to pass an environment variable TEST_LIMIT_START_TIME=<unix_timestamp_in_seconds_since_epoch> that will automatically constrain how far back tests look. This allows the tests to continue to observe installation failures by default, and for disruption suites to pass the time after the disruption to the test suite to limit how far the lookback extends. Update all arbitrary prometheus range queries to use the "reasonable" window (1h by default) that can be clamped or extended by the start time.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: marun, smarterclayton The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/retest |
|
@smarterclayton: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
We're doing better in updates now, and want to ratchet down to bar critical-alert noise during updates. The old 1m alertPeriodCheckMinutes landed with this test in 3b8cb3c (Add CI test to check for crit alerts post upgrade, 2020-03-27, openshift#24786). DurationSinceStartInSeconds, which I'm using now, landed in ace1345 (test: Allow tests that check invariants over time to be constrained, 2021-01-06, openshift#25784).
A number of cluster invariants check without time boundaries - for
instance, kube-apiserver hasn't failed to gracefully restart from all
events stored in the api, or prometheus not reporting any alerts since the
start of the test. However, when these tests run after induced disruption
(like recovering a master or a restored cluster) the test would then fail
because it sees the disruption. The current behavior is useful for
assessing install, but can't scale to disruption events.
Instead of tests hardcoding arbitrary intervals, standardize the lookback
window into a pair of utility functions, one for the complete valid range,
and one for a more limited "look back a reasonable period of time". Allow
the test invoker to pass an environment variable
TEST_LIMIT_START_TIME=<unix_timestamp_in_seconds_since_epoch> that will
automatically constrain how far back tests look. This allows the tests to
continue to observe installation failures by default, and for disruption
suites to pass the time after the disruption to the test suite to limit how
far the lookback extends.
Update all arbitrary prometheus range queries to use the "reasonable"
window (1h by default) that can be clamped or extended by the start time.
@marun, @sttts
Includes #25783 because I also changed that code and then wanted to parameterize it here.