Re-enable metrics, add metrics for vSphere by Danil-Grigorev · Pull Request #590 · openshift/machine-api-operator

Danil-Grigorev · 2020-05-15T09:15:28Z

Working on https://issues.redhat.com/browse/OCPCLOUD-784

Re-enable metrics ports for all containers

This PR is adding support for reporting following prometheus metrics and also starting controller-runtime metrics server to make these metrics available for prometheus servers:

mapi_instance_create_failed: Total count of "create" cloud api errors
mapi_instance_update_failed: Total count of "update" cloud api errors
mapi_instance_delete_failed: Total count of "delete" cloud api errors

Labels on these metrics (for vSphere):

prometheus.Labels{
  "name": machine.Name,
  "namespace": machine.Namespace,
  "reason": error.Reason,
}

openshift-ci-robot · 2020-05-15T09:15:42Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

JoelSpeed

Added some comments as I read through, let me know if you have any questions about them

JoelSpeed · 2020-05-15T10:43:48Z

+		prometheus.CounterOpts{
+			Name: "failed_machine_sets_total",
+			Help: "Number of times machine set provisioning has failed.",
+		}, []string{"name", "namespace", "timestamp", "reason"},


I'm not sure we should include the timestamp, does it add any extra information for the user do you think?

Does not look like, if only having additional unique key for tracking successes or failures for same named machine sets, which sounds too much.

Prometheus will allow you to see this anyway, as it will track the value over time. So you can use a query to work out when the value changed and get a spike on the graph when it was incremented. Good thought though! Definitely the right line of thinking

JoelSpeed · 2020-05-15T10:44:05Z

+		prometheus.CounterOpts{
+			Name: "succeeded_machine_sets_total",
+			Help: "Number of times machine set provisioning has succeded.",
+		}, []string{"name", "namespace", "timestamp"},


Similarly, don't think the timestamp adds anything?

JoelSpeed · 2020-05-15T10:44:12Z

+		}, []string{"name", "namespace", "timestamp", "reason"},
+	)
+
+	// SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine


Suggested change

// SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine

// SucceededMachineSetProvisionCount calculates the number of success provisioning for a MachineSet

Danil-Grigorev · 2020-05-15T15:50:48Z

/retest

Danil-Grigorev · 2020-05-18T12:20:06Z

/retest

JoelSpeed

I realise there's some tidy up to do, added comments where there is so we don't forget them before merging. Added a couple of suggestions, but otherwise looking pretty good

JoelSpeed · 2020-05-22T11:52:07Z

/approve

enxebre · 2020-06-05T08:37:48Z

+  - name: machineset-mtrc
+    targetPort: machineset-mtrc
+    port: 8442
+  - name: nodelink-mtrc


Do we think is there any value we can get out of exposing metrics for the nodelink controller?
I'd expect us to eventually drop it and embed that knowledge into the machine controller.

I think that dropping those metrics and nodelink removal could be done at the same time WDYT?

I can't see any value on conflating that here. If you agree there's no much value on exposing the nodelink metrics let's just undo the changes that expose the controller so we alleviate the operational complexity introduced by this PR.
Dropping the nodelink controller is a completely different discussion that does not need to happen in this PR.

enxebre · 2020-06-05T08:43:41Z

 }
+
+const (
+	VSphereProvider string = "vsphere"


why do we need this? The platform is implicit in any running cluster.

I was assuming the failure rates will be collected across clusters, and aggregated by that value in a longer term?

These metrics are only exposed at each individual cluster scope. If anything were to aggregate them in the future we will consider introduce any further suitable label by then.

Would we not want to have this label if we were bring telemetry back to RH about our controllers?

That's my idea too. May be a better idea to embed the provider name resolving into MAO, so it will identify it from infrastructure config and set it before sending metrics, @enxebre @JoelSpeed?

Would we not want to have this label if we were bring telemetry back to RH about our controllers?

If we do we, we can discuss it when there's a proposal for exposing machine API metrics for telemetry. As for today this property does not provide any value for the consumers of these metrics, i.e users owning the cluster. Most of the information around machine usage it's already expose as metrics by mao, you can get lists of failed machines and we could possibly group by error message. This is complementary.

enxebre · 2020-06-05T08:47:18Z

    k8s-app: machine-api-operator
  sessionAffinity: None
+---
+apiVersion: v1


where is the serviceMonitor that goes with this?

It can’t be merged until provider integration PRs get into repos, or the CI would fail as the machine metrics won’t be accessible. The serviceMonitor i therefore in #609

The provider PRs are

Enable metrics cluster-api-provider-gcp#94

Enable metrics cluster-api-provider-azure#136

Enable metrics cluster-api-provider-aws#324

or the CI would fail as the machine metrics won’t be accessible

can you elaborate? can you point me to the origin test that will fail?

Have a look into ´Alerts shouldn't report any alerts in firing state...´ test in every provider: sample https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/590/pull-ci-openshift-machine-api-operator-master-e2e-aws/2496

https://coreos.slack.com/archives/GE2HQ9QP4/p1591280443332000?thread_ts=1591279283.331700&channel=GE2HQ9QP4&message_ts=1591280443.332000

why would this fire an alert?

It's not the service which is triggering it, but the serviceMonitor from the second PR #609 I included the merge order in Jira issue, posted it in slack too. While provider PRs are not merged, the metrics port is not served from the code, and while Prometheus is trying to connect, it causes machine metrics respond with 502. That results in alert in openshift-monitoring namespace. Joel and me already discussed the issue in slack, and decided to split the PR, instead of revendoring MAO PR branch in every provider.

I thought prometheus was no triggering alert for that scenario. This would need to account for openstack and baremetal.
There's nothing stopping us from including the serviceMonitor for machineset and mhc in this PR and watching working right?

I’ve opened issues in both repos to make sure this won’t be left unanswered.

Enable metrics to support updated MAO metrics integration cluster-api-provider-openstack#100

Metrics port change to support updated MAO metrics integration cluster-api-provider-baremetal#75

including machineset and mhc in the metrics should not be disruptive.

enxebre · 2020-06-05T12:41:49Z

Can we please update the description of the PR to reflect that this is enabling the plumbing for metrics for machineSet and MHC controllers and the structured being used with sevice/serviceMonitor and rbac-proxies?
Can we please also include that in the commit description i.e git ci -m"" -m"here".
Can we please move the vsphere specific changes into its own commit?

This is looking great though generally speaking the smallest the PR the less friction you'll find. We could have get this working e2e for MHC and machineSet and then enabling the providers.

enxebre · 2020-06-05T13:02:24Z

 		}
+		if moTask.Info.State == types.TaskInfoStateError {
+			metrics.RegisterFailedInstanceCreate(&metrics.MachineLabels{
+				Name:      r.machine.Name,


how are this metrics actually being exposed through this controller metrics server?
wouldn't this need to metrics.Registry.MustRegister(failedInstanceCreateCount) or anything?
https://github.com/kubernetes-sigs/controller-runtime/blob/c0438568a706ec61de31b92f4d76e7fb7e1007b9/pkg/internal/controller/metrics/metrics.go#L50

This happens inside the init in https://github.com/openshift/machine-api-operator/pull/590/files#diff-7cbe8e056d62a2de30c7066e359bd9c9R68

elmiko

@Danil-Grigorev we just merged the metrics doc for mao, would you mind adding something about these new metrics to that doc as well?

https://github.com/openshift/machine-api-operator/blob/master/docs/dev/metrics.md

- Added vSphere api metrics

Danil-Grigorev · 2020-06-17T09:50:32Z

/test unit

Danil-Grigorev · 2020-06-17T10:22:03Z

/test unit

JoelSpeed · 2020-06-17T10:27:26Z

/test unit
/lgtm

openshift-bot · 2020-06-17T14:04:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-06-17T14:17:56Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Danil-Grigorev · 2020-06-17T14:28:45Z

/retest

openshift-bot · 2020-06-17T16:01:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

Danil-Grigorev · 2020-06-17T18:19:39Z

/retest

openshift-bot · 2020-06-17T20:22:13Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-06-17T22:15:01Z

@Danil-Grigorev: The following test failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-azure-operator	`eb4b161`	link	`/test e2e-azure-operator`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-bot · 2020-06-17T22:18:42Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2020

openshift-ci-robot requested review from JoelSpeed and michaelgugino May 15, 2020 09:15

enxebre added the release/4.6 label May 15, 2020

Danil-Grigorev force-pushed the re-enable-metrics branch from c4165e0 to 9e1f8af Compare May 15, 2020 09:25

Danil-Grigorev marked this pull request as ready for review May 15, 2020 09:30

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 15, 2020

JoelSpeed reviewed May 15, 2020

View reviewed changes

Danil-Grigorev force-pushed the re-enable-metrics branch from 9e1f8af to 077974a Compare May 15, 2020 11:34

Danil-Grigorev requested a review from JoelSpeed May 15, 2020 11:37

JoelSpeed reviewed May 15, 2020

View reviewed changes

Comment thread cmd/machine-healthcheck/main.go Outdated

Danil-Grigorev force-pushed the re-enable-metrics branch from 077974a to 5df6dd4 Compare May 15, 2020 12:33

Danil-Grigorev force-pushed the re-enable-metrics branch from 5df6dd4 to ffe4a58 Compare May 18, 2020 21:59

Danil-Grigorev marked this pull request as draft May 18, 2020 22:00

Danil-Grigorev force-pushed the re-enable-metrics branch 3 times, most recently from 8ad2615 to d85718a Compare May 22, 2020 09:30

JoelSpeed reviewed May 22, 2020

View reviewed changes

Danil-Grigorev force-pushed the re-enable-metrics branch from d85718a to 0a0b736 Compare May 22, 2020 10:13

Danil-Grigorev marked this pull request as ready for review May 22, 2020 10:17

Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 9430cb2 to 37c97ca Compare May 22, 2020 10:33

Danil-Grigorev requested a review from JoelSpeed May 22, 2020 10:38

Danil-Grigorev changed the title ~~Re-enable metrics, added reporting for machineset operations~~ Re-enable metrics, add metrics for vSphere May 22, 2020

JoelSpeed reviewed May 22, 2020

View reviewed changes

Comment thread pkg/controller/vsphere/metrics.go Outdated

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 22, 2020

Danil-Grigorev requested a review from JoelSpeed June 4, 2020 16:01

Danil-Grigorev force-pushed the re-enable-metrics branch 3 times, most recently from 561a9a9 to 8bb5059 Compare June 4, 2020 20:44

enxebre reviewed Jun 5, 2020

View reviewed changes

Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 72ea3b8 to 08bd94a Compare June 5, 2020 11:47

Danil-Grigorev requested a review from enxebre June 5, 2020 11:50

enxebre reviewed Jun 5, 2020

View reviewed changes

Danil-Grigorev requested a review from enxebre June 10, 2020 18:33

Danil-Grigorev force-pushed the re-enable-metrics branch 2 times, most recently from 71dc77b to 028f0e9 Compare June 11, 2020 12:26

elmiko reviewed Jun 11, 2020

View reviewed changes

Re-enable metrics, added reporting for machine api operations

eb4b161

- Added vSphere api metrics

This was referenced Jun 17, 2020

Enable metrics openshift/cluster-api-provider-gcp#94

Merged

Enable metrics openshift/cluster-api-provider-azure#136

Merged

	// SucceededMachineSetProvisionCount calculates the number of success provisioning for a machine
	// SucceededMachineSetProvisionCount calculates the number of success provisioning for a MachineSet

Conversation

Danil-Grigorev commented May 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openshift-ci-robot commented May 15, 2020

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Danil-Grigorev commented May 15, 2020

Uh oh!

Danil-Grigorev commented May 18, 2020

Uh oh!

JoelSpeed left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JoelSpeed commented May 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Danil-Grigorev Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enxebre Jun 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Danil-Grigorev commented May 15, 2020 •

edited

Loading

enxebre Jun 5, 2020 •

edited

Loading

Danil-Grigorev Jun 5, 2020 •

edited

Loading

enxebre Jun 5, 2020 •

edited

Loading