Skip to content

Metrics cardinality is too high #11248

@skonto

Description

@skonto

/area monitoring

When using Prometheus a standard principle is to have metrics with low cardinality but also as a key concept in monitoring in general. Low cardinality is a key design principle in latest standards too.
Although Prometheus made steps to make things more flexible in the past, when it comes to configuring mem current versions enforce you to limit the time-series ingested to tune mem implicitly, the memory consumption Knative metrics impose is not low. One way to calculate the memory is using this calculator .
Right now we have a lot of metrics which use as a label the revision name, config name, pod name and namespace name.
For example to mention a few:
activator (request_latencies), autoscaler (reconsiler) time series have a complexity of: #histogram_buckets*#revision*#ns
webhook emits similar histogram metrics and depends on number of kinds and namespaces.
To understand the scale if we use 30 buckets (aggregated from several histograms), 100 services and 50 namespaces this means 150K timeseries from one pod.
We have several pods and no Eventing is added in the picture where we have high cardinality due to event_type, filter_type etc.
In the calculator above 1M of time serties with specific assumption needs around 4GB of memory. Given the number of pods we use we can easily reach that number. We already face this downstream.
Here is a sample status report for the top series on Prometheus when using 100 services:
image

Also note that we havent taken into consideration the scenario where a pod name changes due to a restart (it can happen easily). A Prometheus instance is not meant to serve only Knative so in general we should tune our metrics api. I propose we limit our labels to the namespace level not per revision.
Logging should be used to understand the behavior of individual services not metrics. Also we need to reconsider histograms for the webhook and controller cases, buckets make cardinality explode.

What version of Knative?

All versions

Expected Behavior

Metrics should have low cardinality.

Actual Behavior

Excessive number of time series are created.

Steps to Reproduce the Problem

Create a moderate number of namespaces and ksvcs.

/cc @evankanderson @mattmoor @markusthoemmes

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/monitoringkind/bugCategorizes issue or PR as related to a bug.lifecycle/staleDenotes an issue or PR has remained open with no activity and has become stale.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions