Re-enable metrics, add metrics for vSphere#590
Re-enable metrics, add metrics for vSphere#590openshift-merge-robot merged 1 commit intoopenshift:masterfrom
Conversation
|
Skipping CI for Draft Pull Request. |
c4165e0 to
9e1f8af
Compare
JoelSpeed
left a comment
There was a problem hiding this comment.
Added some comments as I read through, let me know if you have any questions about them
| prometheus.CounterOpts{ | ||
| Name: "failed_machine_sets_total", | ||
| Help: "Number of times machine set provisioning has failed.", | ||
| }, []string{"name", "namespace", "timestamp", "reason"}, |
There was a problem hiding this comment.
I'm not sure we should include the timestamp, does it add any extra information for the user do you think?
There was a problem hiding this comment.
Does not look like, if only having additional unique key for tracking successes or failures for same named machine sets, which sounds too much.
There was a problem hiding this comment.
Prometheus will allow you to see this anyway, as it will track the value over time. So you can use a query to work out when the value changed and get a spike on the graph when it was incremented. Good thought though! Definitely the right line of thinking
| prometheus.CounterOpts{ | ||
| Name: "succeeded_machine_sets_total", | ||
| Help: "Number of times machine set provisioning has succeded.", | ||
| }, []string{"name", "namespace", "timestamp"}, |
There was a problem hiding this comment.
Similarly, don't think the timestamp adds anything?
| }, []string{"name", "namespace", "timestamp", "reason"}, | ||
| ) | ||
|
|
||
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine |
There was a problem hiding this comment.
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine | |
| // SucceededMachineSetProvisionCount calculates the number of success provisioning for a MachineSet |
9e1f8af to
077974a
Compare
077974a to
5df6dd4
Compare
|
/retest |
1 similar comment
|
/retest |
5df6dd4 to
ffe4a58
Compare
8ad2615 to
d85718a
Compare
JoelSpeed
left a comment
There was a problem hiding this comment.
I realise there's some tidy up to do, added comments where there is so we don't forget them before merging. Added a couple of suggestions, but otherwise looking pretty good
d85718a to
0a0b736
Compare
9430cb2 to
37c97ca
Compare
|
/approve |
561a9a9 to
8bb5059
Compare
| - name: machineset-mtrc | ||
| targetPort: machineset-mtrc | ||
| port: 8442 | ||
| - name: nodelink-mtrc |
There was a problem hiding this comment.
Do we think is there any value we can get out of exposing metrics for the nodelink controller?
I'd expect us to eventually drop it and embed that knowledge into the machine controller.
There was a problem hiding this comment.
I think that dropping those metrics and nodelink removal could be done at the same time WDYT?
There was a problem hiding this comment.
I can't see any value on conflating that here. If you agree there's no much value on exposing the nodelink metrics let's just undo the changes that expose the controller so we alleviate the operational complexity introduced by this PR.
Dropping the nodelink controller is a completely different discussion that does not need to happen in this PR.
| } | ||
|
|
||
| const ( | ||
| VSphereProvider string = "vsphere" |
There was a problem hiding this comment.
why do we need this? The platform is implicit in any running cluster.
There was a problem hiding this comment.
I was assuming the failure rates will be collected across clusters, and aggregated by that value in a longer term?
There was a problem hiding this comment.
These metrics are only exposed at each individual cluster scope. If anything were to aggregate them in the future we will consider introduce any further suitable label by then.
There was a problem hiding this comment.
Would we not want to have this label if we were bring telemetry back to RH about our controllers?
There was a problem hiding this comment.
That's my idea too. May be a better idea to embed the provider name resolving into MAO, so it will identify it from infrastructure config and set it before sending metrics, @enxebre @JoelSpeed?
There was a problem hiding this comment.
Would we not want to have this label if we were bring telemetry back to RH about our controllers?
If we do we, we can discuss it when there's a proposal for exposing machine API metrics for telemetry. As for today this property does not provide any value for the consumers of these metrics, i.e users owning the cluster. Most of the information around machine usage it's already expose as metrics by mao, you can get lists of failed machines and we could possibly group by error message. This is complementary.
| k8s-app: machine-api-operator | ||
| sessionAffinity: None | ||
| --- | ||
| apiVersion: v1 |
There was a problem hiding this comment.
where is the serviceMonitor that goes with this?
There was a problem hiding this comment.
It can’t be merged until provider integration PRs get into repos, or the CI would fail as the machine metrics won’t be accessible. The serviceMonitor i therefore in #609
There was a problem hiding this comment.
There was a problem hiding this comment.
or the CI would fail as the machine metrics won’t be accessible
can you elaborate? can you point me to the origin test that will fail?
There was a problem hiding this comment.
Have a look into ´Alerts shouldn't report any alerts in firing state...´ test in every provider: sample https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/590/pull-ci-openshift-machine-api-operator-master-e2e-aws/2496
There was a problem hiding this comment.
It's not the service which is triggering it, but the serviceMonitor from the second PR #609 I included the merge order in Jira issue, posted it in slack too. While provider PRs are not merged, the metrics port is not served from the code, and while Prometheus is trying to connect, it causes machine metrics respond with 502. That results in alert in openshift-monitoring namespace. Joel and me already discussed the issue in slack, and decided to split the PR, instead of revendoring MAO PR branch in every provider.
There was a problem hiding this comment.
I thought prometheus was no triggering alert for that scenario. This would need to account for openstack and baremetal.
There's nothing stopping us from including the serviceMonitor for machineset and mhc in this PR and watching working right?
There was a problem hiding this comment.
I’ve opened issues in both repos to make sure this won’t be left unanswered.
- Enable metrics to support updated MAO metrics integration cluster-api-provider-openstack#100
- Metrics port change to support updated MAO metrics integration cluster-api-provider-baremetal#75
including machineset and mhc in the metrics should not be disruptive.
72ea3b8 to
08bd94a
Compare
|
Can we please update the description of the PR to reflect that this is enabling the plumbing for metrics for machineSet and MHC controllers and the structured being used with sevice/serviceMonitor and rbac-proxies? This is looking great though generally speaking the smallest the PR the less friction you'll find. We could have get this working e2e for MHC and machineSet and then enabling the providers. |
| } | ||
| if moTask.Info.State == types.TaskInfoStateError { | ||
| metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{ | ||
| Name: r.machine.Name, |
There was a problem hiding this comment.
how are this metrics actually being exposed through this controller metrics server?
wouldn't this need to metrics.Registry.MustRegister(failedInstanceCreateCount) or anything?
https://github.com/kubernetes-sigs/controller-runtime/blob/c0438568a706ec61de31b92f4d76e7fb7e1007b9/pkg/internal/controller/metrics/metrics.go#L50
There was a problem hiding this comment.
This happens inside the init in https://github.com/openshift/machine-api-operator/pull/590/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R68
71dc77b to
028f0e9
Compare
elmiko
left a comment
There was a problem hiding this comment.
@Danil-Grigorev we just merged the metrics doc for mao, would you mind adding something about these new metrics to that doc as well?
https://github.com/openshift/machine-api-operator/blob/master/docs/dev/metrics.md
- Added vSphere api metrics
|
/test unit |
1 similar comment
|
/test unit |
|
/test unit |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
1 similar comment
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
/retest |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
|
@Danil-Grigorev: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/retest Please review the full test history for this PR and help us cut down flakes. |
Working on https://issues.redhat.com/browse/OCPCLOUD-784
This PR is adding support for reporting following prometheus metrics and also starting controller-runtime metrics server to make these metrics available for prometheus servers:
mapi_instance_create_failed: Total count of "create" cloud api errorsmapi_instance_update_failed: Total count of "update" cloud api errorsmapi_instance_delete_failed: Total count of "delete" cloud api errorsLabels on these metrics (for vSphere):