[OCPCLOUD-910] Add HistogramVector to track transition into different Machine phases#640
Conversation
|
Skipping CI for Draft Pull Request. |
a5916c5 to
9e5a3e9
Compare
|
i'm +1 for this metric, i think it gives an interesting window into machine creation timings. i would love if we could add a label to track each machine as well, but i have a feel the unbounded cardinality would not be appreciated. |
|
/retest |
|
/retest |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: enxebre The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@JoelSpeed This is a big step forward for us to narrow down our test failures. Big 👍 from me. |
|
/hold cancel |
elmiko
left a comment
There was a problem hiding this comment.
i still like this, holding back lgtm to let others get a look.
|
/test e2e-azure |
|
@JoelSpeed: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
|
||
| // Update the metric after everything else has succeeded to prevent duplicate | ||
| // entries when there are failures | ||
| if phase != phaseDeleting { |
There was a problem hiding this comment.
We should protect against duplicate calls to the same phase so we don't spoil this metric.
There was a problem hiding this comment.
I had considered this and thought it to be covered by L#432, we only update the metric on changes to the phase right? Do you think there's some extra nuance or case that needs to be covered here?
There was a problem hiding this comment.
Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.
One thing to consider is protecting against empty phase.
There was a problem hiding this comment.
Nope, I missed that line. We should probably refactor this function to quit early instead of putting everything in this giant if statement.
Ack, that would be sensible!
One thing to consider is protecting against empty phase.
How do you mean here, as in if it were empty then skipped ahead to eg Running? Wondering when this could happen, only when the status is blanked somehow? Maybe on creation of master machines during IPI? How would you suggest we protect here? Should we only record "" to "provisioning" and ignore "" to anything else?
There was a problem hiding this comment.
Yeah, probably outside the scope of this patch set. Perhaps we need a validatePhaseTransition() function to ensure we're doing the right thing here.
I'm confident the code will work as-is today, I'm worried about protecting against bugs in the future. We can probably just make another card for this, I think we can ship as-is.
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
This PR adds a metric which tracks the phase transitions for Machines as they are transitioning to
Provisioning,Provisioned,RunningorFailed. This will allow us to calculate, for example, what the average creation time for a Machine is and see which phases are particularly slow.Eg, the average time it took for a machine to enter each of these phases:
Since this is a histogram, we should be able to create some interesting graphs in Grafana once this is merged to allow customers to track how long different phases are taking
Potential future work: If we could get the MCS to serve the same metric and set an imaginary phase for
IgnitionFetched, we could also see how long the Machine took to get to fetching ignition config.