add per-tenant alertmanager metrics#2116
add per-tenant alertmanager metrics#2116jtlisi wants to merge 6 commits intocortexproject:masterfrom
Conversation
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
e6360aa to
004565d
Compare
pracucci
left a comment
There was a problem hiding this comment.
All in all, good job. I think I've spot a couple of issues. My main concern is about metrics cardinality whenever we have the user label (ie. in the past I've been told to not add 1 single extra metric with the user label in the ingester, here we have many). For this reason, I would like to better understand if we need the user label for every metric where it has been added or some of them can be converted into global metrics instead of per-tenant ones.
pkg/util/metrics_helper.go
Outdated
There was a problem hiding this comment.
Please add unit tests on this function.
There was a problem hiding this comment.
👍 I added a test that tests this function and ensures it properly sums metrics with multiple series.
pkg/util/metrics_helper.go
Outdated
| for user, userMetrics := range d { | ||
| metricsPerLabelValue := getMetricsWithLabelNames(userMetrics[metric], labelNames) | ||
| for _, mlv := range metricsPerLabelValue { | ||
| for _, m := range mlv.metrics { |
There was a problem hiding this comment.
The function name is SendSumOfCountersPerUserWithLabels() so if you have more than one mlv.metrics I would expect it sum them and the writes to out only 1 metric. On the contrary, I think with the current logic we will have clashing series (same exact series reported multiple times) and should be fixed. This case should be covered with a unit test too.
_ SendSumOfGaugesPerUserWithLabels_
There was a problem hiding this comment.
👍 updated function and added a test
| silencesPropagatedMessagesTotal *prometheus.Desc | ||
| } | ||
|
|
||
| func newAlertmanagerMetrics() *alertmanagerMetrics { |
There was a problem hiding this comment.
What's the impact, in terms of cardinality, for all the metrics with the user label? I'm wondering if this change may potentially lead to a series explosion.
For example, the metrics with integration can have 9 integrations, so the cardinality is 10x the number of tenants for each of such series (x the number of alert manager instances).
There was a problem hiding this comment.
I paired down the number of metrics that are reporting a user label and removed the integration entirely.
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
Signed-off-by: Jacob Lisi <jacob.t.lisi@gmail.com>
|
Closing to switch to #2124 which is not source in the Grafana Org which is unable to run integration tests in circleci due to env variables set at the org level. |
What this PR does:
This PR takes advantage of the
util.MetricFamiliesPerUserstruct to provide per-tenant Alertmanager metrics.Which issue(s) this PR fixes:
Fixes #1631
Checklist
CHANGELOG.mdupdated