test: Assess average series rather than max over the test window #25783
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With the recent increase in cluster metrics, some disruptive tests
can trigger errors that result in a burst of
cluster_operator_conditions or alerts series that then clear after
the disruption. We want to run the full suite after we run a
disruption, and in general we are concerned with average over max,
so shorten the interval we check to 1h and calculate the average.
When looking at telemetry from 4.7 CI clusters, the disruptive tests
BRIEFLY peak at 600 series and then fall to 300 almost immediately
after. Using the average, the total count is closer to 400 over the
hour the tests run and that better represents the desired goal of
the test (to limit average load, not spikes). Check the maximum as
double the average.
Resolves failures encountered when attempting to run the disruptive
suite (destroy the cluster and recover) and then the conformance
suite. Subsequent PR will remove the skip on disruptive
@marun, @lilic