Skip to content

add per-tenant alertmanager metrics#2116

Closed
jtlisi wants to merge 6 commits intocortexproject:masterfrom
grafana:20200211_pertenant_am_metrics
Closed

add per-tenant alertmanager metrics#2116
jtlisi wants to merge 6 commits intocortexproject:masterfrom
grafana:20200211_pertenant_am_metrics

Conversation

@jtlisi
Copy link
Contributor

@jtlisi jtlisi commented Feb 12, 2020

What this PR does:

This PR takes advantage of the util.MetricFamiliesPerUser struct to provide per-tenant Alertmanager metrics.

Which issue(s) this PR fixes:
Fixes #1631

Checklist

  • Tests updated
  • CHANGELOG.md updated

Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
@jtlisi jtlisi force-pushed the 20200211_pertenant_am_metrics branch from e6360aa to 004565d Compare February 12, 2020 18:21
Copy link
Contributor

@pracucci pracucci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All in all, good job. I think I've spot a couple of issues. My main concern is about metrics cardinality whenever we have the user label (ie. in the past I've been told to not add 1 single extra metric with the user label in the ingester, here we have many). For this reason, I would like to better understand if we need the user label for every metric where it has been added or some of them can be converted into global metrics instead of per-tenant ones.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add unit tests on this function.

Copy link
Contributor Author

@jtlisi jtlisi Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 I added a test that tests this function and ensures it properly sums metrics with multiple series.

for user, userMetrics := range d {
metricsPerLabelValue := getMetricsWithLabelNames(userMetrics[metric], labelNames)
for _, mlv := range metricsPerLabelValue {
for _, m := range mlv.metrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function name is SendSumOfCountersPerUserWithLabels() so if you have more than one mlv.metrics I would expect it sum them and the writes to out only 1 metric. On the contrary, I think with the current logic we will have clashing series (same exact series reported multiple times) and should be fixed. This case should be covered with a unit test too.

_ SendSumOfGaugesPerUserWithLabels_

Copy link
Contributor Author

@jtlisi jtlisi Feb 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 updated function and added a test

silencesPropagatedMessagesTotal *prometheus.Desc
}

func newAlertmanagerMetrics() *alertmanagerMetrics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the impact, in terms of cardinality, for all the metrics with the user label? I'm wondering if this change may potentially lead to a series explosion.

For example, the metrics with integration can have 9 integrations, so the cardinality is 10x the number of tenants for each of such series (x the number of alert manager instances).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I paired down the number of metrics that are reporting a user label and removed the integration entirely.

Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
@jtlisi jtlisi mentioned this pull request Feb 12, 2020
2 tasks
@jtlisi
Copy link
Contributor Author

jtlisi commented Feb 12, 2020

Closing to switch to #2124 which is not source in the Grafana Org which is unable to run integration tests in circleci due to env variables set at the org level.

@jtlisi jtlisi closed this Feb 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tenanted Alertmanagers are not currently instrumented

2 participants